What defines a game’s success?

Introduction and project description

The purpose of this research is to explore the factors that define a game’s success, exploring genre, price of the game, in app purchases, description of the game, what languages the game is offered in, the developers of the game, and the age range of the audience to whom is permitted to play the game. By analyzing these attributes, we are capable of better understanding the reasons behind the success rate of some app games over others. The data entitled “17K Mobile Strategy Games” in which we are analyzing is made up of all strategy games from the Apple App store. We hypothesize games that are free with a wide audience and eye-catching descriptions will draw more users and lead to a game’s success/popularity, where success is being defined as a game which has both a high user count as well as high user ratings. We are making the assumption that games that are free would attract a higher user count and cause those games to have a better chance of succeeding. The results of this research may help gaming developers prioritize some of the important attributes discovered and aid in their games’ success and popularity worldwide.

Throughout the course of this project we will be answering the following questions:

Data exploration and visualization 1. Which genre is the most popular? 2. Which words are most commonly used in Description of games? 3. Does genre of games cause people to spend more money? 4. What are the top languages in which games are offered? 5. What is the distribution of user rating across genre? 6. Which genre of game does better internationally? 7. What is the relationship between initial price of apps and average user rating? 8. What is the average price of in-app purchases? 9. Is there a relationship between user rating and in-app purchases? And does the amount of available in-app purchases decrease rating? 10. What information can we find about game developers and their strategy games? 11. What is the frequency of the age groups? 12. How has the size of the applications of the top 3 primary genres changed over a span of about 11 years?

Data analysis, modeling and/or predictions 13. What contributes to a game’s success? 14. Can we predict if an app is free or not? 15. What primary genre is similar to the “Games” genre?

——————————————————————

To start this analysis we first want to clean the original dataset:

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.2     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## -- Column specification --------------------------------------------------------
## cols(
##   URL = col_character(),
##   ID = col_double(),
##   Name = col_character(),
##   Subtitle = col_character(),
##   `Icon URL` = col_character(),
##   `Average User Rating` = col_double(),
##   `User Rating Count` = col_double(),
##   Price = col_double(),
##   `In-app Purchases` = col_character(),
##   Description = col_character(),
##   Developer = col_character(),
##   `Age Rating` = col_character(),
##   Languages = col_character(),
##   Size = col_double(),
##   `Primary Genre` = col_character(),
##   Genres = col_character(),
##   `Original Release Date` = col_character(),
##   `Current Version Release Date` = col_character()
## )
## # A tibble: 6 x 15
##   Name    `Icon URL`    `Average User R~ `User Rating Co~ Price `In-app Purchas~
##   <chr>   <chr>                    <dbl>            <dbl> <dbl> <chr>           
## 1 Sudoku  https://is2-~              4               3553  2.99 <NA>            
## 2 Reversi https://is4-~              3.5              284  1.99 <NA>            
## 3 Morocco https://is5-~              3               8376  0    <NA>            
## 4 Sudoku~ https://is3-~              3.5           190394  0    <NA>            
## 5 Senet ~ https://is1-~              3.5               28  2.99 <NA>            
## 6 Sudoku~ https://is1-~              3                 47  0    1.99            
## # ... with 9 more variables: Description <chr>, Developer <chr>,
## #   Age Rating <chr>, Languages <chr>, Size <dbl>, Primary Genre <chr>,
## #   Genres <chr>, Original Release Date <date>,
## #   Current Version Release Date <date>

Separating Data and Renaming Variables:

## # A tibble: 7,488 x 4
## # Groups:   "Price", "Average_User_Rating" [1]
##    Price   AUR `"Price"` `"Average_User_Rating"`
##    <dbl> <dbl> <chr>     <chr>                  
##  1  2.99   4   Price     Average_User_Rating    
##  2  1.99   3.5 Price     Average_User_Rating    
##  3  0      3   Price     Average_User_Rating    
##  4  0      3.5 Price     Average_User_Rating    
##  5  2.99   3.5 Price     Average_User_Rating    
##  6  0      3   Price     Average_User_Rating    
##  7  0      2.5 Price     Average_User_Rating    
##  8  0.99   2.5 Price     Average_User_Rating    
##  9  0      2.5 Price     Average_User_Rating    
## 10  0      2.5 Price     Average_User_Rating    
## # ... with 7,478 more rows

——————————————————————

Data exploration and visualization

——————————————————————

2. Does genre of games cause people to spend more money?

After exploring which genre is most popular among users, we examined if genre of games had any influence on the amount of money spent for the app (purchase price) or in the app (in-app purchases).

Cleaning and separating data

## # A tibble: 7,488 x 5
## # Groups:   "Genres" [1]
##    Name                      Price InApp Genres                       `"Genres"`
##    <chr>                     <dbl> <chr> <chr>                        <chr>     
##  1 "Sudoku"                   2.99 <NA>  Games, Strategy, Puzzle      Genres    
##  2 "Reversi"                  1.99 <NA>  Games, Strategy, Board       Genres    
##  3 "Morocco"                  0    <NA>  Games, Board, Strategy       Genres    
##  4 "Sudoku (Free)"            0    <NA>  Games, Strategy, Puzzle      Genres    
##  5 "Senet Deluxe"             2.99 <NA>  Games, Strategy, Board, Edu~ Genres    
##  6 "Sudoku - Classic number~  0    1.99  Games, Entertainment, Strat~ Genres    
##  7 "Gravitation"              0    <NA>  Games, Entertainment, Puzzl~ Genres    
##  8 "Colony"                   0.99 <NA>  Games, Strategy, Board       Genres    
##  9 "Carte"                    0    <NA>  Games, Strategy, Board, Ent~ Genres    
## 10 "\"Barrels O' Fun\""       0    <NA>  Games, Casual, Strategy      Genres    
## # ... with 7,478 more rows

Due to the genre and in-app purchases columns having values separated by columns, the separate_rows() function is utilized in order to split each individual element onto a new line.

## # A tibble: 48 x 3
##    Genres     avgPrice avgInApp
##    <chr>         <dbl>    <dbl>
##  1 News         0.0762     23.9
##  2 Networking   0.0490     21.0
##  3 Social       0.0490     21.0
##  4 Medical      0.707      20.0
##  5 Business     0.318      17.6
##  6 Playing      0.250      17.6
##  7 Role         0.250      17.6
##  8 Card         0.339      12.8
##  9 Action       0.285      12.0
## 10 Simulation   0.382      11.9
## # ... with 38 more rows
## # A tibble: 48 x 4
##    Genres       avgPrice avgInApp totalavg
##    <chr>           <dbl>    <dbl>    <dbl>
##  1 Weather         9.99    NaN       9.99 
##  2 Finance         3.97     11.3    15.3  
##  3 Reference       3.65      5.74    9.38 
##  4 Board           1.04      6.28    7.32 
##  5 Education       0.714     5.31    6.02 
##  6 Medical         0.707    20.0    20.7  
##  7 Productivity    0.638     9.88   10.5  
##  8 Emoji           0.495   NaN       0.495
##  9 Expressions     0.495   NaN       0.495
## 10 Utilities       0.389     3.08    3.47 
## # ... with 38 more rows
## # A tibble: 48 x 4
##    Genres     avgPrice avgInApp totalavg
##    <chr>         <dbl>    <dbl>    <dbl>
##  1 News         0.0762     23.9     23.9
##  2 Networking   0.0490     21.0     21.1
##  3 Social       0.0490     21.0     21.1
##  4 Medical      0.707      20.0     20.7
##  5 Business     0.318      17.6     18.0
##  6 Playing      0.250      17.6     17.9
##  7 Role         0.250      17.6     17.9
##  8 Finance      3.97       11.3     15.3
##  9 Card         0.339      12.8     13.1
## 10 Simulation   0.382      11.9     12.3
## # ... with 38 more rows

Upon being separated into new rows, the summarize() function is employed to calculate the average purchase price for the app itself, average of money spent on in-app purchases, and the total average amount spent on both the app itself and in-app purchases.

Upon visual inspection and viewing previous tables, based on average upfront cost of the app, the weather genre is the leading most popular genre where people are willing to spend an average of $9.99. Whereas, for the News genre, little upfront cost is paid, however, the average in-app purchases are at $23.87. The top 3 genres that caused people to spend the most money are: News, Networking, and Social.

——————————————————————

3. Which words are most commonly used in the Description of games?

A game’s description is just as important as the hook is in an essay. Just as the hook draws in your audience, the description is used to attract more users to your game which is key to a game’s success. If no one is finding your game, then your description has not adequately captivated your audience. So, in order to determine which words were most often used to describe games, we split the description of the game into multiple strings and found the frequency of each word used, removing any sort of article or non-descriptive words such as “a, the, an, it, this, be, etc.” Looking primarily for adjectives, words that could describe what made their game different or special compared to others, the following word cloud to the left depicts some of the top words used in game descriptions.

Cleaning Original Data

## # A tibble: 6 x 15
##   Name    `Icon URL`    `Average User R~ `User Rating Co~ Price `In-app Purchas~
##   <chr>   <chr>                    <dbl>            <dbl> <dbl> <chr>           
## 1 Sudoku  https://is2-~              4               3553  2.99 <NA>            
## 2 Reversi https://is4-~              3.5              284  1.99 <NA>            
## 3 Morocco https://is5-~              3               8376  0    <NA>            
## 4 Sudoku~ https://is3-~              3.5           190394  0    <NA>            
## 5 Senet ~ https://is1-~              3.5               28  2.99 <NA>            
## 6 Sudoku~ https://is1-~              3                 47  0    1.99            
## # ... with 9 more variables: Description <chr>, Developer <chr>,
## #   Age Rating <chr>, Languages <chr>, Size <dbl>, Primary Genre <chr>,
## #   Genres <chr>, Original Release Date <date>,
## #   Current Version Release Date <date>

Cleaning Description and Creating Wordcloud

## # A tibble: 6 x 2
##   Name   word 
##   <chr>  <chr>
## 1 Sudoku join 
## 2 Sudoku over 
## 3 Sudoku of   
## 4 Sudoku our  
## 5 Sudoku fans 
## 6 Sudoku and
## Joining, by = "word"
## # A tibble: 6 x 2
##   Name   word    
##   <chr>  <chr>   
## 1 Sudoku join    
## 2 Sudoku fans    
## 3 Sudoku download
## 4 Sudoku one     
## 5 Sudoku sudoku  
## 6 Sudoku game
## # A tibble: 10 x 2
##    word         n
##    <chr>    <int>
##  1 game     21505
##  2 play      6732
##  3 new       5678
##  4 world     4064
##  5 players   3616
##  6 strategy  3457
##  7 time      3434
##  8 free      3410
##  9 battle    3356
## 10 levels    3099

Looking at the wordcloud we can see that game is use the most often. Afterward is play, new, world, players, levels.

Top 10 Game Descriptors, excluding ambiguous phrases/words and repeated plural versions of the same word:

  1. game
  2. play
  3. new
  4. world
  5. players
  6. strategy
  7. time
  8. free
  9. battle
  10. levels

——————————————————————

4. What are the top languages in which games are offered?

Cleaning Original Data

## # A tibble: 6 x 15
##   Name    `Icon URL`    `Average User R~ `User Rating Co~ Price `In-app Purchas~
##   <chr>   <chr>                    <dbl>            <dbl> <dbl> <chr>           
## 1 Sudoku  https://is2-~              4               3553  2.99 <NA>            
## 2 Reversi https://is4-~              3.5              284  1.99 <NA>            
## 3 Morocco https://is5-~              3               8376  0    <NA>            
## 4 Sudoku~ https://is3-~              3.5           190394  0    <NA>            
## 5 Senet ~ https://is1-~              3.5               28  2.99 <NA>            
## 6 Sudoku~ https://is1-~              3                 47  0    1.99            
## # ... with 9 more variables: Description <chr>, Developer <chr>,
## #   Age Rating <chr>, Languages <chr>, Size <dbl>, Primary Genre <chr>,
## #   Genres <chr>, Original Release Date <date>,
## #   Current Version Release Date <date>

Frequency Bar Chart of Languages

Cleaning the Language Column

## # A tibble: 6 x 2
##   Name   Languages
##   <chr>  <chr>    
## 1 Sudoku DA       
## 2 Sudoku NL       
## 3 Sudoku EN       
## 4 Sudoku FI       
## 5 Sudoku FR       
## 6 Sudoku DE

Finding Frequency of Top 15 Languages

## # A tibble: 6 x 3
##   Languages total full_lang
##   <chr>     <int> <lgl>    
## 1 EN         7429 NA       
## 2 DE         1573 NA       
## 3 ZH         1548 NA       
## 4 FR         1519 NA       
## 5 ES         1473 NA       
## 6 JA         1354 NA

Import Full Lang, Clean Dataset, Create Loop to Put Full Name Instead of Abbrev

## 
## -- Column specification --------------------------------------------------------
## cols(
##   alpha2 = col_character(),
##   English = col_character()
## )
## # A tibble: 10 x 2
##    alpha2 English                                                               
##    <chr>  <chr>                                                                 
##  1 br     Breton                                                                
##  2 bs     Bosnian                                                               
##  3 ca     Catalan; Valencian                                                    
##  4 ce     Chechen                                                               
##  5 ch     Chamorro                                                              
##  6 co     Corsican                                                              
##  7 cr     Cree                                                                  
##  8 cs     Czech                                                                 
##  9 cu     Church Slavic; Old Slavonic; Church Slavonic; Old Bulgarian; Old Chur~
## 10 cv     Chuvash
## # A tibble: 10 x 2
##    alpha2 English      
##    <chr>  <chr>        
##  1 BR     Breton       
##  2 BS     Bosnian      
##  3 CA     Catalan      
##  4 CE     Chechen      
##  5 CH     Chamorro     
##  6 CO     Corsican     
##  7 CR     Cree         
##  8 CS     Czech        
##  9 CU     Church Slavic
## 10 CV     Chuvash

Bar Plot of Top 15 Languages

We wanted to explore what languages occurred most often in applications. As was expected, the most popular application language is English, followed by German, Chinese, and French. Most likely that is because most of the audience on the apple store speaks English, so most apps include the language.

—————————————————————–

5. What is the distribution of user rating across genre?

Cleaning Original Data

## # A tibble: 6 x 15
##   Name    `Icon URL`    `Average User R~ `User Rating Co~ Price `In-app Purchas~
##   <chr>   <chr>                    <dbl>            <dbl> <dbl> <chr>           
## 1 Sudoku  https://is2-~              4               3553  2.99 <NA>            
## 2 Reversi https://is4-~              3.5              284  1.99 <NA>            
## 3 Morocco https://is5-~              3               8376  0    <NA>            
## 4 Sudoku~ https://is3-~              3.5           190394  0    <NA>            
## 5 Senet ~ https://is1-~              3.5               28  2.99 <NA>            
## 6 Sudoku~ https://is1-~              3                 47  0    1.99            
## # ... with 9 more variables: Description <chr>, Developer <chr>,
## #   Age Rating <chr>, Languages <chr>, Size <dbl>, Primary Genre <chr>,
## #   Genres <chr>, Original Release Date <date>,
## #   Current Version Release Date <date>

Violin of Average User Rating Across Genre

## [1] "Games"         "Entertainment" "Education"     "Utilities"    
## [5] "Sports"        "Reference"
## # A tibble: 6 x 2
##   `Average User Rating` `Primary Genre`
##                   <dbl> <chr>          
## 1                   4   Games          
## 2                   3.5 Games          
## 3                   3   Games          
## 4                   3.5 Games          
## 5                   3.5 Games          
## 6                   3   Games

Next we wanted to look at the average user rating across different primary genres. From the violin plot, you can see that there’s a lot of variability in each primary genre with the exception of the book genre. A possible reason the book genre has little outliers is because there isn’t as much data as say the games genre.

The graph also shows that on the left half there isn’t really a high concentration of user ratings in one area, it’s kind of spread around in comparison to something like the games genre where you can specifically see that there’s a higher concentration of ratings around 4.5.

——————————————————————

6. Which genre of game does better internationally?

After identifying the top languages in which games are offered, we then decided to delve into which genre of games did better internationally.

Cleaning and separating data

## # A tibble: 7,488 x 4
## # Groups:   "Price", "Average_User_Rating" [1]
##    Price   AUR `"Price"` `"Average_User_Rating"`
##    <dbl> <dbl> <chr>     <chr>                  
##  1  2.99   4   Price     Average_User_Rating    
##  2  1.99   3.5 Price     Average_User_Rating    
##  3  0      3   Price     Average_User_Rating    
##  4  0      3.5 Price     Average_User_Rating    
##  5  2.99   3.5 Price     Average_User_Rating    
##  6  0      3   Price     Average_User_Rating    
##  7  0      2.5 Price     Average_User_Rating    
##  8  0.99   2.5 Price     Average_User_Rating    
##  9  0      2.5 Price     Average_User_Rating    
## 10  0      2.5 Price     Average_User_Rating    
## # ... with 7,478 more rows

The separate_rows() function is utilized on the languages and genres columns in order to place each individual language and genre separated by column into a new row.

## # A tibble: 6 x 3
##   Name   Languages Genres  
##   <chr>  <chr>     <chr>   
## 1 Sudoku DA        Games   
## 2 Sudoku DA        Strategy
## 3 Sudoku DA        Puzzle  
## 4 Sudoku NL        Games   
## 5 Sudoku NL        Strategy
## 6 Sudoku NL        Puzzle

Since we are only wanting to look at international languages, English is excluded from this dataset, and the data frame is grouped by language and genre, summarizing the total count of each language/genre pair.

## `summarise()` has grouped output by 'Languages'. You can override using the `.groups` argument.
## # A tibble: 1,389 x 3
## # Groups:   Languages [112]
##    Languages Genres        total
##    <chr>     <chr>         <int>
##  1 ZH        Games          2712
##  2 ZH        Strategy       2712
##  3 DE        Games          1573
##  4 DE        Strategy       1573
##  5 FR        Games          1519
##  6 FR        Strategy       1519
##  7 ES        Games          1473
##  8 ES        Strategy       1473
##  9 ZH        Entertainment  1408
## 10 JA        Games          1354
## # ... with 1,379 more rows

Based on this table, the top two genres that are the most popular are in ZH (Chinese) and have a games and strategy genre , with DE (German) and FR (French) coming in 2nd and 3rd place, also favoring games and strategy genres.

——————————————————————

7. What is the relationship between initial price of apps and average user rating?

Next, we wanted to look at the relationship between different age ratings and their user rating across primary genres. If there are missing columns like in finance, it just means that the finance apps generally have their apps available for all ages.

Looking at the books graph, you can see that the book applications that are rated for teens and up have a higher rating ran for children. The games genre is relatively similar throughout all age ratings, slightly dropping off at the 17+ games.

What was most interesting was that the social networking apps that allowed children 4+ to use the application were rated really low. The ratings could be from upset parents frustrated that their child is messaging someone online. Companies could possibly look at this and set age restrictions to prevent younger children going onto these social networking apps and maybe their ratings will increase.

Cleaning and separating data

## # A tibble: 7,488 x 4
## # Groups:   "Price", "Average_User_Rating" [1]
##    Price   AUR `"Price"` `"Average_User_Rating"`
##    <dbl> <dbl> <chr>     <chr>                  
##  1  2.99   4   Price     Average_User_Rating    
##  2  1.99   3.5 Price     Average_User_Rating    
##  3  0      3   Price     Average_User_Rating    
##  4  0      3.5 Price     Average_User_Rating    
##  5  2.99   3.5 Price     Average_User_Rating    
##  6  0      3   Price     Average_User_Rating    
##  7  0      2.5 Price     Average_User_Rating    
##  8  0.99   2.5 Price     Average_User_Rating    
##  9  0      2.5 Price     Average_User_Rating    
## 10  0      2.5 Price     Average_User_Rating    
## # ... with 7,478 more rows
library(gridExtra) # for the grid.arrange() function

G1 <-ggplot(data = clean_data) +
            geom_bar(mapping = aes(x = Price))+
            coord_cartesian(xlim = c(0, 20)) +
            labs(title = "Overall Price", # change title
                x = "Prices (excluding prices over $20)") # change x lab

G2 <-ggplot(data = clean_data) +
        geom_bar(mapping = aes(x = AUR))+
        coord_cartesian(xlim = c(0, 5)) +
        labs(title = "Overall Average User Rating", # change title
                x = "Average USer Rating") # change x lab

grid.arrange(G1, G2,ncol=2)

The distribution for prices and ratings. One of the most important factors people would look at is money. It’s more likely that a game that is free would have more downloads and users than a game with an initial monetary entry. Unsurprisingly, when a game or app is free, the user count is massively higher than games that require an upfront cost. Many questions would also come from this such as the quality of product from a free game vs one that is paid. Some might think a paid game would naturally be “better in quality” than one that is free since the cost of entry is higher. The overall average user rating showed that 4.5 is the most common rating between all price points combined.

## # A tibble: 123 x 4
## # Groups:   "Price", "Average_User_Rating" [1]
##    Price   AUR `"Price"` `"Average_User_Rating"`
##    <dbl> <dbl> <chr>     <chr>                  
##  1  5.99   4   Price     Average_User_Rating    
##  2  7.99   4   Price     Average_User_Rating    
##  3  7.99   4   Price     Average_User_Rating    
##  4  5.99   2.5 Price     Average_User_Rating    
##  5  9.99   4   Price     Average_User_Rating    
##  6  9.99   5   Price     Average_User_Rating    
##  7  7.99   4   Price     Average_User_Rating    
##  8  5.99   3   Price     Average_User_Rating    
##  9  5.99   4.5 Price     Average_User_Rating    
## 10  9.99   3.5 Price     Average_User_Rating    
## # ... with 113 more rows
## [[1]]
## # A tibble: 6,269 x 4
## # Groups:   "Price", "Average_User_Rating" [1]
##    Price   AUR `"Price"` `"Average_User_Rating"`
##    <dbl> <dbl> <chr>     <chr>                  
##  1     0   3   Price     Average_User_Rating    
##  2     0   3.5 Price     Average_User_Rating    
##  3     0   3   Price     Average_User_Rating    
##  4     0   2.5 Price     Average_User_Rating    
##  5     0   2.5 Price     Average_User_Rating    
##  6     0   2.5 Price     Average_User_Rating    
##  7     0   3.5 Price     Average_User_Rating    
##  8     0   3   Price     Average_User_Rating    
##  9     0   2.5 Price     Average_User_Rating    
## 10     0   3   Price     Average_User_Rating    
## # ... with 6,259 more rows
## 
## [[2]]
## # A tibble: 348 x 4
## # Groups:   "Price", "Average_User_Rating" [1]
##    Price   AUR `"Price"` `"Average_User_Rating"`
##    <dbl> <dbl> <chr>     <chr>                  
##  1  0.99   2.5 Price     Average_User_Rating    
##  2  0.99   3.5 Price     Average_User_Rating    
##  3  0.99   3   Price     Average_User_Rating    
##  4  0.99   2   Price     Average_User_Rating    
##  5  0.99   4   Price     Average_User_Rating    
##  6  0.99   2.5 Price     Average_User_Rating    
##  7  0.99   3.5 Price     Average_User_Rating    
##  8  0.99   3.5 Price     Average_User_Rating    
##  9  0.99   3   Price     Average_User_Rating    
## 10  0.99   3   Price     Average_User_Rating    
## # ... with 338 more rows
## 
## [[3]]
## # A tibble: 446 x 4
## # Groups:   "Price", "Average_User_Rating" [1]
##    Price   AUR `"Price"` `"Average_User_Rating"`
##    <dbl> <dbl> <chr>     <chr>                  
##  1  2.99   4   Price     Average_User_Rating    
##  2  1.99   3.5 Price     Average_User_Rating    
##  3  2.99   3.5 Price     Average_User_Rating    
##  4  2.99   4   Price     Average_User_Rating    
##  5  2.99   2.5 Price     Average_User_Rating    
##  6  2.99   4   Price     Average_User_Rating    
##  7  2.99   3.5 Price     Average_User_Rating    
##  8  1.99   4   Price     Average_User_Rating    
##  9  2.99   4   Price     Average_User_Rating    
## 10  2.99   3   Price     Average_User_Rating    
## # ... with 436 more rows
## 
## [[4]]
## # A tibble: 285 x 4
## # Groups:   "Price", "Average_User_Rating" [1]
##    Price   AUR `"Price"` `"Average_User_Rating"`
##    <dbl> <dbl> <chr>     <chr>                  
##  1  4.99   4   Price     Average_User_Rating    
##  2  4.99   4   Price     Average_User_Rating    
##  3  3.99   3   Price     Average_User_Rating    
##  4  4.99   3.5 Price     Average_User_Rating    
##  5  3.99   4.5 Price     Average_User_Rating    
##  6  4.99   4.5 Price     Average_User_Rating    
##  7  3.99   3.5 Price     Average_User_Rating    
##  8  3.99   3.5 Price     Average_User_Rating    
##  9  4.99   4   Price     Average_User_Rating    
## 10  4.99   4   Price     Average_User_Rating    
## # ... with 275 more rows
## 
## [[5]]
## # A tibble: 123 x 4
## # Groups:   "Price", "Average_User_Rating" [1]
##    Price   AUR `"Price"` `"Average_User_Rating"`
##    <dbl> <dbl> <chr>     <chr>                  
##  1  5.99   4   Price     Average_User_Rating    
##  2  7.99   4   Price     Average_User_Rating    
##  3  7.99   4   Price     Average_User_Rating    
##  4  5.99   2.5 Price     Average_User_Rating    
##  5  9.99   4   Price     Average_User_Rating    
##  6  9.99   5   Price     Average_User_Rating    
##  7  7.99   4   Price     Average_User_Rating    
##  8  5.99   3   Price     Average_User_Rating    
##  9  5.99   4.5 Price     Average_User_Rating    
## 10  9.99   3.5 Price     Average_User_Rating    
## # ... with 113 more rows
## 
## [[6]]
## # A tibble: 17 x 4
## # Groups:   "Price", "Average_User_Rating" [1]
##    Price   AUR `"Price"` `"Average_User_Rating"`
##    <dbl> <dbl> <chr>     <chr>                  
##  1  20.0   4.5 Price     Average_User_Rating    
##  2  12.0   3.5 Price     Average_User_Rating    
##  3  12.0   4.5 Price     Average_User_Rating    
##  4 140.    4.5 Price     Average_User_Rating    
##  5  20.0   4.5 Price     Average_User_Rating    
##  6  13.0   4   Price     Average_User_Rating    
##  7  20.0   3.5 Price     Average_User_Rating    
##  8  20.0   4   Price     Average_User_Rating    
##  9  20.0   4   Price     Average_User_Rating    
## 10  15.0   3.5 Price     Average_User_Rating    
## 11  13.0   4   Price     Average_User_Rating    
## 12  15.0   4   Price     Average_User_Rating    
## 13  17.0   4   Price     Average_User_Rating    
## 14  13.0   3   Price     Average_User_Rating    
## 15  12.0   4.5 Price     Average_User_Rating    
## 16  37.0   4   Price     Average_User_Rating    
## 17  60.0   4   Price     Average_User_Rating

Filtering Games by prices

Plotting to see what the average ratings for apps in each price point are. We decided to separate the charts to see if the ratings would be different or not. As you can see, the free apps have a sample size that is way higher than all the other price points combined. This isn’t much of a surprise since we expect the free games to have a much lower point of entry than the others. This leads to more players trying out the game. The more shocking information is how the ratings between all price points were relatively constant throughout. The price point that had the lowest ratings overall seem to be the games that were the most expensive too. This raises a question on whether the quality of product expected only starts to come in when an app or game goes up to a certain price point and beyond.

——————————————————————

8. What is the average price of in-app purchases?

After taking a glance at the initial purchase price of apps, we then explored in-app purchase prices.

In order to find the average price of in-app purchases, the dataset was filtered to include only the name, price, average user rating, and in-app purchase prices. Since the in app purchase prices column had multiple price offerings per game separated by column, we used the separate rows() function to split each individual in app purchase price onto a new row, and convert all values to numerics. From there, the summarise function was implemented to find the average in-app purchase price.

Average User Rating = AUR InApp = In app purchases

## # A tibble: 1 x 3
##   avgPrice avgRating avgInApp
##      <dbl>     <dbl>    <dbl>
## 1    0.321      4.19     11.4

The average price of in-app purchases is approximately $11.40.

——————————————————————

9. Is there a relationship between user rating and in-app purchases? And does the amount of available in-app purchases decrease rating?

Then, after calculating the average in-app purchase price, we examined if a relationship existed between user ratings and in-app purchases.

So in order to visually see this, we plotted the average user rating against the in-app purchase price to determine if there was any sort of trend, along with checking the correlation between the two variables.

## [1] -0.01262201

The negative correlation returned from Average User Rating and In App Purchases can lead us to believe that as the cost of In App Purchases increase, therefore, the Average User Rating will decrease.

However, with a correlation coefficient of -0.01, being so close to 0, shows that while a negative relationship does exist, the existence of a relationship between In App Purchases and Average User Rating is extremely minimal.

——————————————————————

10. What information can we find about game developers and their strategy games?

## Selecting by User Rating Count
## # A tibble: 5 x 4
##   Name              Developer               `Average User Rat~ `User Rating Cou~
##   <chr>             <chr>                                <dbl>             <dbl>
## 1 "Clash of Clans"  Supercell                              4.5           3032734
## 2 "Clash Royale"    Supercell                              4.5           1277095
## 3 "PUBG MOBILE"     Tencent Mobile Interna~                4.5            711409
## 4 "Plants vs. Zomb~ PopCap                                 4.5            469562
## 5 "Pok\\xe9mon GO"  Niantic, Inc.                          3.5            439776

For this graph, the popularity of a game is measured with a high user rating count, instead of Average User Rating. Average User Count is not a good measurement for popularity because a lot of games can have a very high rating, but very low count of ratings. When looking at the graph, a surprising thing we found was that the two most popular games were both created by the same game developer, Supercell. 4 out of 5 of the games shown on the graph also have a really high average user rating of 4.5.

——————————————————————

11. What is the frequency of the age groups?

When looking at the graph, a large majority of the games have the age ratings as 4 and above. Games have lower age ratings so they can attract more users.

——————————————————————

12. How has the size of the applications of the top 3 primary genres changed over a span of about 11 years?

Cleaning Original Data

## # A tibble: 6 x 15
##   Name    `Icon URL`    `Average User R~ `User Rating Co~ Price `In-app Purchas~
##   <chr>   <chr>                    <dbl>            <dbl> <dbl> <chr>           
## 1 Sudoku  https://is2-~              4               3553  2.99 <NA>            
## 2 Reversi https://is4-~              3.5              284  1.99 <NA>            
## 3 Morocco https://is5-~              3               8376  0    <NA>            
## 4 Sudoku~ https://is3-~              3.5           190394  0    <NA>            
## 5 Senet ~ https://is1-~              3.5               28  2.99 <NA>            
## 6 Sudoku~ https://is1-~              3                 47  0    1.99            
## # ... with 9 more variables: Description <chr>, Developer <chr>,
## #   Age Rating <chr>, Languages <chr>, Size <dbl>, Primary Genre <chr>,
## #   Genres <chr>, Original Release Date <date>,
## #   Current Version Release Date <date>
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.

Next we asked how the size of the applications of the top 3 primary genres changed over a span of about 11 years. As we can see, the bytes of the applications increased quite a lot, averaging now at about 3 time 10^8 bytes.

This makes sense as applications become more complex including more lines of code, more features, higher resolution images and 3D models with more polygons. This will all increase the size of the application.

——————————————————————

Data analysis, modeling, and/or predictions

13. What contributes to a game’s success?

Linear Regression Analysis (Measuring by Average User Rating )

There are a lot of ways to measure the success of a game. With the dataset we have , we decided that Average User rating would be a good way to measure that success.

We decided to go with these 4 variables as our predictors since they seem to be important factors that would play a part in a game’s rating. Age rating helps focus a game to a specific age group which might give it a better chance of a good rating. Expectations of a game based on age rating might differ and some of those expectations might be easier to satisfy compared to others. Price gives an expectation on how good the game should be as users would’ve given an initial “investment” before actually playing the game. Size would make a game more appealing as it would mean the game has more features and might be a more refined game compared to those who are much smaller in size. User Rating Count can show how active the game is and lets us know the sample size behind each rating. A bigger sample size would be better since it would reinforce if the game would be entertaining for a large group of people.

Null hypothesis: H0: β1 = β2 = · · · = βp = 0 There is no relationship between X1, X2, · · · , Xp and Y at all

Alternative hypothesis: Hα: at least one βj =/= 0 There is some relationship between Xj and Y .

This will be our hypothesis testing to see if the predictors have a relationship with our Y, which in this case would be Average User Rating. To test this we would have to calculate the p-value for the predictors with relationship to our Y.

## 
## Call:
## lm(formula = `Average User Rating` ~ Price + `User Rating Count` + 
##     `Age Rating` + Size, data = clean_games)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1101 -0.5272  0.2995  0.4624  1.0804 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          4.073e+00  2.295e-02 177.502  < 2e-16 ***
## Price               -3.729e-03  3.623e-03  -1.029  0.30336    
## `User Rating Count`  5.267e-07  2.038e-07   2.584  0.00978 ** 
## `Age Rating`17+     -1.518e-01  4.884e-02  -3.108  0.00189 ** 
## `Age Rating`4+      -4.668e-02  2.439e-02  -1.914  0.05568 .  
## `Age Rating`9+      -1.075e-02  2.863e-02  -0.375  0.70735    
## Size                 1.650e-10  3.572e-11   4.619 3.92e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7484 on 7481 degrees of freedom
## Multiple R-squared:  0.006409,   Adjusted R-squared:  0.005612 
## F-statistic: 8.042 on 6 and 7481 DF,  p-value: 1.123e-08
##       value       numdf       dendf 
##    8.042076    6.000000 7481.000000

We ran a multiple linear regression test and we were able to get some important information. Firstly, our p-value given is really low and because of that, we can safely reject our null hypothesis and accept our alternative hypothesis while saying that there is some relationship between our Y and predictors. Aslo, our F-statistics is greater than 1 which also tells us that there are some relationships between predictors and Y. Then, we got a very low adjusted R2. A low adjusted R2 indicates that the independent variable is not explaining much in the variation of the dependent variable. RSE tells us the lack of fit and a small RSE tells us how good the fit of the model would be. The RSE we got was really low so it tells us that that model fits really well in our data. Lastly, let’s pick our best predictor out of the bunch to see which one would define our Y the best. If we look at the p-value and abs(t-value) we can also conclude that size is the best predictor for Y or Average User Rating since it has the lowest p-value by far and the highest t-value.

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

We plotted the predictors and see that Age rating doesn’t show a linear regression line while Price has a constant neutral regression line. However, both Size and User Rating count have a positive linear trend.

—————————————————————–

14. Can we predict if an app is free or not?

For the next model, we wanted to predict if the application will be free or not using multiple logistic regression.

To start off, we had to do some initial cleaning. The in-app purchases column contained a string of all the purchases the app had. From there we separated the rows and turned them into doubles. Now that they were doubles we could summarise to find the total amount of in app purchases, total count and avg iap. We also created classes to group the app avg iaps because that would prove to be useful in later models.

## # A tibble: 6 x 14
##   Name   `Average User Ra~ `User Rating Co~ Price `In-app Purchas~ Description  
##   <chr>              <dbl>            <dbl> <dbl> <chr>            <chr>        
## 1 Sudoku               4               3553  2.99 <NA>             "Join over 2~
## 2 Rever~               3.5              284  1.99 <NA>             "The classic~
## 3 Moroc~               3               8376  0    <NA>             "Play the cl~
## 4 Sudok~               3.5           190394  0    <NA>             "Top 100 fre~
## 5 Senet~               3.5               28  2.99 <NA>             "\"Senet Del~
## 6 Sudok~               3                 47  0    1.99             "Sudoku will~
## # ... with 8 more variables: Developer <chr>, Age Rating <chr>,
## #   Languages <chr>, Size <dbl>, Primary Genre <chr>, Genres <chr>,
## #   Original Release Date <date>, Current Version Release Date <date>
## # A tibble: 7,488 x 4
##    Name                                              sum.iap count.iap avg.iap
##    <chr>                                               <dbl>     <int>   <dbl>
##  1 "Bungee Stickmen - Australian Landmarks {LITE +}"    239.         3    79.7
##  2 "Arcane Pets: Plushie Empire"                        300.         4    75.0
##  3 "War of Nations\\u2122 - PVP Strategy"               675.        10    67.5
##  4 "War Planet Online"                                  665.        10    66.5
##  5 "Final Fantasy XV: A New Empire"                     655.        10    65.5
##  6 "My Math Elementary Kids Games"                      174.         3    58.0
##  7 "Imperial Ambition"                                  551.        10    55.1
##  8 "Idle Crypto Tycoon"                                 103.         2    51.5
##  9 "World War Rising"                                   515.        10    51.5
## 10 "Clash of Queens: Light or Dark"                     411.         8    51.4
## # ... with 7,478 more rows
## Warning: Unknown or uninitialised column: `iap.class`.
## # A tibble: 6 x 5
##   Name                                 sum.iap count.iap avg.iap iap.class   
##   <chr>                                  <dbl>     <int>   <dbl> <chr>       
## 1 "- Turning -"                           3.98         2    1.99 $0.01-$10.00
## 2 "! Chess !"                             0            1    0    $0          
## 3 "\"100 Years' War\""                    3.98         2    1.99 $0.01-$10.00
## 4 "\"3D Rubik's Cube : Rubik Solver\""    0            1    0    $0          
## 5 "\"3x3 Rubik's Cube Solver\""           0            1    0    $0          
## 6 "\"9 Men's Morris\""                    0.99         1    0.99 $0.01-$10.00

Next, we were given date columns like the day the app was released and the day they were last updated, but we can’t really use the dates in a model. So we found the total number of days since the app was released and the days since its last update by subtracting the date of release and the date of last update by the date the data was scraped (08-03-2019).

And because we wanted to predict if the applications are free or not, we use an if-else statement to assign a 1 if the app was free and a 0 if it wasn’t.

## # A tibble: 6 x 6
##   `Original Release ~ `Current Version Rel~ sum.iap count.iap avg.iap iap.class 
##   <date>              <date>                  <dbl>     <int>   <dbl> <chr>     
## 1 2008-07-11          2017-05-30               0            1    0    $0        
## 2 2008-07-11          2018-05-17               0            1    0    $0        
## 3 2008-07-11          2017-09-05               0            1    0    $0        
## 4 2008-07-23          2017-05-30               0            1    0    $0        
## 5 2008-07-18          2018-07-22               0            1    0    $0        
## 6 2008-07-30          2019-04-29               1.99         1    1.99 $0.01-$10~
## # A tibble: 7,464 x 6
##    Name              Price Today      days.since.relea~ days.since.last.u~  free
##    <chr>             <dbl> <date>                 <dbl>              <dbl> <dbl>
##  1 "Sudoku"           2.99 2019-08-03              4040                795     0
##  2 "Reversi"          1.99 2019-08-03              4040                443     0
##  3 "Morocco"          0    2019-08-03              4040                697     1
##  4 "Sudoku (Free)"    0    2019-08-03              4028                795     1
##  5 "Senet Deluxe"     2.99 2019-08-03              4033                377     0
##  6 "Sudoku - Classi~  0    2019-08-03              4021                 96     1
##  7 "Colony"           0.99 2019-08-03              4017                304     0
##  8 "Carte"            0    2019-08-03              4017                618     1
##  9 "\"Barrels O' Fu~  0    2019-08-03              4019               4019     1
## 10 "Lumen Lite"       0    2019-08-03              4002               3906     1
## # ... with 7,454 more rows

The last of the cleaning before we start modeling is to remove unnecessary variables and separate the language and genre variables by their delimiters shown in the before and after.

## # A tibble: 7,488 x 3
##    Name               Languages                            Genres               
##    <chr>              <chr>                                <chr>                
##  1 "Sudoku"           DA, NL, EN, FI, FR, DE, IT, JA, KO,~ Games, Strategy, Puz~
##  2 "Reversi"          EN                                   Games, Strategy, Boa~
##  3 "Morocco"          EN                                   Games, Board, Strate~
##  4 "Sudoku (Free)"    DA, NL, EN, FI, FR, DE, IT, JA, KO,~ Games, Strategy, Puz~
##  5 "Senet Deluxe"     DA, NL, EN, FR, DE, EL, IT, JA, KO,~ Games, Strategy, Boa~
##  6 "Sudoku - Classic~ EN                                   Games, Entertainment~
##  7 "Gravitation"      <NA>                                 Games, Entertainment~
##  8 "Colony"           EN                                   Games, Strategy, Boa~
##  9 "Carte"            FR                                   Games, Strategy, Boa~
## 10 "\"Barrels O' Fun~ EN                                   Games, Casual, Strat~
## # ... with 7,478 more rows
## # A tibble: 101,229 x 3
##    Name   Languages Genres  
##    <chr>  <chr>     <chr>   
##  1 Sudoku DA        Games   
##  2 Sudoku DA        Strategy
##  3 Sudoku DA        Puzzle  
##  4 Sudoku NL        Games   
##  5 Sudoku NL        Strategy
##  6 Sudoku NL        Puzzle  
##  7 Sudoku EN        Games   
##  8 Sudoku EN        Strategy
##  9 Sudoku EN        Puzzle  
## 10 Sudoku FI        Games   
## # ... with 101,219 more rows
## # A tibble: 101,229 x 13
##    `Average User Rating` `User Rating Count` `Age Rating` Languages     Size
##                    <dbl>               <dbl> <chr>        <chr>        <dbl>
##  1                     4                3553 4+           DA        15853568
##  2                     4                3553 4+           DA        15853568
##  3                     4                3553 4+           DA        15853568
##  4                     4                3553 4+           NL        15853568
##  5                     4                3553 4+           NL        15853568
##  6                     4                3553 4+           NL        15853568
##  7                     4                3553 4+           EN        15853568
##  8                     4                3553 4+           EN        15853568
##  9                     4                3553 4+           EN        15853568
## 10                     4                3553 4+           FI        15853568
## # ... with 101,219 more rows, and 8 more variables: Primary Genre <chr>,
## #   Genres <chr>, sum.iap <dbl>, count.iap <int>, iap.class <chr>,
## #   days.since.release <dbl>, days.since.last.update <dbl>, free <dbl>

The first model we created was a full width base model. This meant that we would use the base information given, with none of the new variables we made. The base predictors were average user rating, user rating count, age rating, languages, primary genre, sub genre and the size of the app. Because we are using logistic regression we used the glm function with family equal to binomial to predict if the app was free.

We also created a function to find the misclassification error to decrease redundancy.

Multiple Logistic Regression of Original Data Given (No Edits)

Creating a function to find mce

## # A tibble: 101,229 x 8
##    `Average User Rating` `User Rating Count` `Age Rating` Languages     Size
##                    <dbl>               <dbl> <chr>        <chr>        <dbl>
##  1                     4                3553 4+           DA        15853568
##  2                     4                3553 4+           DA        15853568
##  3                     4                3553 4+           DA        15853568
##  4                     4                3553 4+           NL        15853568
##  5                     4                3553 4+           NL        15853568
##  6                     4                3553 4+           NL        15853568
##  7                     4                3553 4+           EN        15853568
##  8                     4                3553 4+           EN        15853568
##  9                     4                3553 4+           EN        15853568
## 10                     4                3553 4+           FI        15853568
## # ... with 101,219 more rows, and 3 more variables: Primary Genre <chr>,
## #   Genres <chr>, free <dbl>
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Call:
## glm(formula = free ~ ., family = binomial(), data = logit.data.orig)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2428   0.3306   0.5118   0.5927   2.1978  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       1.645e+01  3.603e+02   0.046 0.963577    
## `Average User Rating`             1.006e-01  1.326e-02   7.588 3.24e-14 ***
## `User Rating Count`               1.354e-05  9.251e-07  14.631  < 2e-16 ***
## `Age Rating`17+                   3.639e-01  6.694e-02   5.437 5.43e-08 ***
## `Age Rating`4+                   -2.647e-01  2.750e-02  -9.624  < 2e-16 ***
## `Age Rating`9+                   -4.281e-01  2.961e-02 -14.457  < 2e-16 ***
## LanguagesAM                       1.503e+01  9.037e+02   0.017 0.986727    
## LanguagesAR                       1.599e+00  6.574e-01   2.432 0.015035 *  
## LanguagesAS                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesAY                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesAZ                       1.460e+01  6.918e+02   0.021 0.983161    
## LanguagesBE                       1.468e+01  5.975e+02   0.025 0.980403    
## LanguagesBG                       2.117e+00  8.163e-01   2.594 0.009499 ** 
## LanguagesBN                       1.492e+01  2.135e+02   0.070 0.944299    
## LanguagesBO                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesBR                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesBS                       1.459e+01  5.405e+02   0.027 0.978456    
## LanguagesCA                       1.179e+00  6.596e-01   1.788 0.073781 .  
## LanguagesCS                       9.104e-01  6.478e-01   1.405 0.159887    
## LanguagesCY                       1.507e+01  8.991e+02   0.017 0.986632    
## LanguagesDA                       6.123e-01  6.458e-01   0.948 0.343076    
## LanguagesDE                       5.381e-03  6.404e-01   0.008 0.993296    
## LanguagesDZ                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesEL                       9.919e-01  6.499e-01   1.526 0.126944    
## LanguagesEN                       1.708e-01  6.397e-01   0.267 0.789466    
## LanguagesEO                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesES                       1.004e-01  6.405e-01   0.157 0.875451    
## LanguagesET                       1.459e+01  4.283e+02   0.034 0.972816    
## LanguagesEU                       1.460e+01  6.918e+02   0.021 0.983161    
## LanguagesFA                       1.124e+00  7.063e-01   1.592 0.111482    
## LanguagesFI                       6.498e-01  6.472e-01   1.004 0.315407    
## LanguagesFO                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesFR                       1.507e-02  6.404e-01   0.024 0.981220    
## LanguagesGA                       1.384e+01  6.201e+02   0.022 0.982188    
## LanguagesGD                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesGL                       1.460e+01  6.918e+02   0.021 0.983161    
## LanguagesGN                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesGU                       1.495e+01  2.341e+02   0.064 0.949070    
## LanguagesGV                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesHE                       9.322e-01  6.523e-01   1.429 0.152968    
## LanguagesHI                       1.351e+00  6.842e-01   1.975 0.048253 *  
## LanguagesHR                       1.589e+00  7.102e-01   2.237 0.025272 *  
## LanguagesHU                       1.006e+00  6.528e-01   1.541 0.123335    
## LanguagesHY                       1.500e+01  3.408e+02   0.044 0.964892    
## LanguagesID                       1.380e+00  6.506e-01   2.121 0.033901 *  
## LanguagesIS                       1.482e+01  5.969e+02   0.025 0.980189    
## LanguagesIT                       2.075e-01  6.409e-01   0.324 0.746090    
## LanguagesIU                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesJA                       2.869e-01  6.407e-01   0.448 0.654329    
## LanguagesJV                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesKA                       1.460e+01  6.918e+02   0.021 0.983161    
## LanguagesKK                       1.434e+01  8.010e+02   0.018 0.985716    
## LanguagesKL                       1.518e+01  8.363e+02   0.018 0.985515    
## LanguagesKM                       1.493e+01  7.552e+02   0.020 0.984231    
## LanguagesKN                       1.492e+01  2.256e+02   0.066 0.947279    
## LanguagesKO                       2.698e-01  6.410e-01   0.421 0.673797    
## LanguagesKR                       1.520e+01  8.441e+02   0.018 0.985629    
## LanguagesKS                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesKU                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesKY                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesLA                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesLO                       1.493e+01  7.552e+02   0.020 0.984231    
## LanguagesLT                       8.708e-01  8.797e-01   0.990 0.322234    
## LanguagesLV                       1.491e+01  2.243e+02   0.066 0.946991    
## LanguagesMG                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesMK                       1.473e+01  5.078e+02   0.029 0.976857    
## LanguagesML                       1.493e+01  2.448e+02   0.061 0.951375    
## LanguagesMN                       1.503e+01  9.037e+02   0.017 0.986727    
## LanguagesMR                       1.496e+01  2.308e+02   0.065 0.948329    
## LanguagesMS                       1.308e+00  6.563e-01   1.993 0.046211 *  
## LanguagesMT                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesMY                       1.451e+01  6.082e+02   0.024 0.980968    
## LanguagesNB                       6.055e-01  6.467e-01   0.936 0.349101    
## LanguagesNE                       1.460e+01  6.918e+02   0.021 0.983161    
## LanguagesNL                       3.977e-01  6.425e-01   0.619 0.535925    
## LanguagesNN                       5.536e-01  8.289e-01   0.668 0.504231    
## LanguagesNO                       2.245e-01  6.849e-01   0.328 0.743026    
## LanguagesOM                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesOR                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesPA                       1.494e+01  2.498e+02   0.060 0.952314    
## LanguagesPL                       3.334e-01  6.427e-01   0.519 0.603903    
## LanguagesPS                       1.434e+01  8.010e+02   0.018 0.985716    
## LanguagesPT                       3.269e-01  6.411e-01   0.510 0.610111    
## LanguagesQU                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesRN                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesRO                       1.760e+00  6.679e-01   2.636 0.008398 ** 
## LanguagesRU                       2.485e-01  6.408e-01   0.388 0.698193    
## LanguagesRW                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesSA                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesSD                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesSE                       7.954e-02  8.434e-01   0.094 0.924859    
## LanguagesSI                       1.474e+01  6.183e+02   0.024 0.980983    
## LanguagesSK                       9.631e-01  6.538e-01   1.473 0.140702    
## LanguagesSL                       1.887e+00  8.174e-01   2.309 0.020963 *  
## LanguagesSO                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesSQ                       1.481e+01  4.669e+02   0.032 0.974685    
## LanguagesSR                       9.674e-01  8.244e-01   1.173 0.240613    
## LanguagesSU                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesSV                       3.141e-01  6.431e-01   0.488 0.625204    
## LanguagesSW                       1.460e+01  6.918e+02   0.021 0.983161    
## LanguagesTA                       1.492e+01  2.256e+02   0.066 0.947279    
## LanguagesTE                       1.489e+01  2.387e+02   0.062 0.950264    
## LanguagesTG                       1.434e+01  8.010e+02   0.018 0.985716    
## LanguagesTH                       1.073e+00  6.473e-01   1.658 0.097239 .  
## LanguagesTI                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesTK                       1.434e+01  8.010e+02   0.018 0.985716    
## LanguagesTL                       1.519e+01  4.676e+02   0.032 0.974082    
## LanguagesTO                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesTR                       7.150e-01  6.431e-01   1.112 0.266236    
## LanguagesTT                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesUG                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesUK                       1.363e+00  6.624e-01   2.057 0.039692 *  
## LanguagesUR                       1.480e+01  3.942e+02   0.038 0.970056    
## LanguagesUZ                       1.434e+01  8.010e+02   0.018 0.985716    
## LanguagesVI                       1.248e+00  6.510e-01   1.916 0.055333 .  
## LanguagesYI                       1.496e+01  1.196e+03   0.013 0.990017    
## LanguagesZH                       3.911e-01  6.406e-01   0.610 0.541578    
## LanguagesZU                       1.513e+01  1.385e+03   0.011 0.991280    
## Size                             -8.699e-10  2.910e-11 -29.891  < 2e-16 ***
## `Primary Genre`Business          -1.854e+01  3.603e+02  -0.051 0.958963    
## `Primary Genre`Education         -1.652e+01  3.603e+02  -0.046 0.963418    
## `Primary Genre`Entertainment     -1.358e+01  3.603e+02  -0.038 0.969933    
## `Primary Genre`Finance           -1.410e+01  3.603e+02  -0.039 0.968781    
## `Primary Genre`Food & Drink      -4.442e-01  1.092e+03   0.000 0.999675    
## `Primary Genre`Games             -1.480e+01  3.603e+02  -0.041 0.967231    
## `Primary Genre`Health & Fitness  -1.709e+01  3.603e+02  -0.047 0.962156    
## `Primary Genre`Lifestyle         -1.728e+01  3.603e+02  -0.048 0.961743    
## `Primary Genre`Medical            1.065e+00  5.470e+02   0.002 0.998447    
## `Primary Genre`Music             -9.439e-01  4.132e+02  -0.002 0.998177    
## `Primary Genre`Navigation        -7.060e-01  1.172e+03  -0.001 0.999519    
## `Primary Genre`News              -2.954e-01  9.208e+02   0.000 0.999744    
## `Primary Genre`Productivity      -1.835e+01  3.603e+02  -0.051 0.959381    
## `Primary Genre`Reference         -1.538e+01  3.603e+02  -0.043 0.965956    
## `Primary Genre`Shopping           9.306e-02  1.431e+03   0.000 0.999948    
## `Primary Genre`Social Networking -9.199e-01  4.626e+02  -0.002 0.998413    
## `Primary Genre`Sports            -1.429e+01  3.603e+02  -0.040 0.968367    
## `Primary Genre`Stickers          -1.590e+01  3.603e+02  -0.044 0.964792    
## `Primary Genre`Travel            -9.081e-02  1.252e+03   0.000 0.999942    
## `Primary Genre`Utilities         -1.614e+01  3.603e+02  -0.045 0.964259    
## GenresAdventure                  -2.582e-01  9.506e-02  -2.716 0.006603 ** 
## GenresBoard                      -1.018e+00  6.734e-02 -15.110  < 2e-16 ***
## GenresBooks                      -2.643e-01  6.422e-01  -0.412 0.680648    
## GenresBusiness                   -4.916e-01  3.436e-01  -1.431 0.152423    
## GenresCard                       -1.975e-01  1.007e-01  -1.962 0.049765 *  
## GenresCasino                      1.458e+00  7.191e-01   2.028 0.042591 *  
## GenresCasual                      4.888e-01  1.143e-01   4.277 1.89e-05 ***
## GenresDrink                       1.558e+00  7.208e-01   2.161 0.030694 *  
## GenresEducation                  -9.034e-01  1.020e-01  -8.858  < 2e-16 ***
## GenresEmoji                      -1.013e+00  1.579e+00  -0.641 0.521225    
## GenresEntertainment              -1.628e-01  5.835e-02  -2.789 0.005279 ** 
## GenresExpressions                -1.013e+00  1.579e+00  -0.641 0.521225    
## GenresFamily                      5.840e-01  1.285e-01   4.546 5.46e-06 ***
## GenresFinance                     1.100e+00  7.300e-01   1.507 0.131758    
## GenresFitness                     1.757e+00  1.086e+00   1.618 0.105558    
## GenresFood                        1.558e+00  7.208e-01   2.161 0.030694 *  
## GenresGames                      -2.384e-01  5.486e-02  -4.345 1.39e-05 ***
## GenresGaming                     -2.126e-01  1.413e+00  -0.150 0.880401    
## GenresHealth                      1.757e+00  1.086e+00   1.618 0.105558    
## GenresKids                        1.586e+01  2.400e+03   0.007 0.994726    
## GenresLifestyle                   1.025e+00  2.653e-01   3.864 0.000112 ***
## GenresMagazines                   1.454e+01  2.400e+03   0.006 0.995166    
## GenresMedical                    -2.786e+00  1.226e+00  -2.272 0.023089 *  
## GenresMusic                       1.704e+00  5.088e-01   3.349 0.000810 ***
## GenresNavigation                  1.408e+01  9.375e+02   0.015 0.988018    
## GenresNetworking                  1.267e+00  3.266e-01   3.879 0.000105 ***
## GenresNews                       -1.993e-01  1.084e+00  -0.184 0.854106    
## GenresNewspapers                  1.454e+01  2.400e+03   0.006 0.995166    
## GenresPhoto                       1.470e+01  7.944e+02   0.019 0.985239    
## GenresPlaying                     1.615e-01  7.567e-02   2.134 0.032827 *  
## GenresProductivity                2.052e-01  4.647e-01   0.442 0.658730    
## GenresPuzzle                     -3.900e-01  6.786e-02  -5.747 9.11e-09 ***
## GenresRacing                      5.463e-01  3.559e-01   1.535 0.124766    
## GenresReference                  -9.045e-01  2.845e-01  -3.180 0.001473 ** 
## GenresRole                        1.615e-01  7.567e-02   2.134 0.032827 *  
## GenresShopping                   -2.904e-01  2.770e+03   0.000 0.999916    
## GenresSimulation                 -4.561e-01  6.540e-02  -6.974 3.08e-12 ***
## GenresSocial                      1.267e+00  3.266e-01   3.879 0.000105 ***
## GenresSports                      8.742e-01  1.632e-01   5.358 8.40e-08 ***
## GenresStickers                   -2.126e-01  1.413e+00  -0.150 0.880401    
## GenresStrategy                   -2.384e-01  5.486e-02  -4.346 1.39e-05 ***
## GenresTravel                     -2.837e-01  4.499e-01  -0.631 0.528264    
## GenresTrivia                      1.402e+00  3.329e-01   4.210 2.56e-05 ***
## GenresUtilities                   2.995e-03  2.130e-01   0.014 0.988778    
## GenresVideo                       1.470e+01  7.944e+02   0.019 0.985239    
## GenresWeather                    -1.839e+01  2.400e+03  -0.008 0.993886    
## GenresWord                        4.724e-01  4.710e-01   1.003 0.315914    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 81752  on 101228  degrees of freedom
## Residual deviance: 76180  on 101043  degrees of freedom
## AIC: 76552
## 
## Number of Fisher Scoring iterations: 15
## [1] 0.1372433

Looking at the coefficients of the base full width model. As you can see, the model’s quite unsightly. There are a lot of insignificant coefficients with only 11/113 of the languages being significant, none of the primary genres being significant and half of the sub genres are significant. However, for a base model, a misclassification error or mce of 0.137 is not bad.

Next, we added on the cleaned predictors we made to our original model and compared the misclassification errors. The added predictors were: sum of in-app purchases or sum.iap, count.iap, iap.class, days since release and days since last update.

Multiple Logistic Regression with All New Columns

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Call:
## glm(formula = free ~ ., family = binomial(), data = logit.data.clean)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5708   0.0947   0.2637   0.4962   3.1204  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       1.769e+01  5.658e+02   0.031 0.975055    
## `Average User Rating`            -1.606e-01  1.562e-02 -10.281  < 2e-16 ***
## `User Rating Count`               1.239e-05  9.048e-07  13.699  < 2e-16 ***
## `Age Rating`17+                   8.972e-01  7.647e-02  11.734  < 2e-16 ***
## `Age Rating`4+                    6.009e-01  3.342e-02  17.980  < 2e-16 ***
## `Age Rating`9+                   -6.091e-01  3.593e-02 -16.953  < 2e-16 ***
## LanguagesAM                       1.662e+01  1.301e+03   0.013 0.989808    
## LanguagesAR                       1.012e+00  7.410e-01   1.366 0.171826    
## LanguagesAS                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesAY                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesAZ                       1.613e+01  1.014e+03   0.016 0.987307    
## LanguagesBE                       1.581e+01  8.487e+02   0.019 0.985138    
## LanguagesBG                       2.284e+00  9.075e-01   2.517 0.011849 *  
## LanguagesBN                       1.477e+01  3.142e+02   0.047 0.962505    
## LanguagesBO                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesBR                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesBS                       1.508e+01  8.691e+02   0.017 0.986159    
## LanguagesCA                       6.967e-01  7.441e-01   0.936 0.349100    
## LanguagesCS                       6.976e-01  7.318e-01   0.953 0.340461    
## LanguagesCY                       1.527e+01  1.439e+03   0.011 0.991538    
## LanguagesDA                       4.432e-01  7.302e-01   0.607 0.543865    
## LanguagesDE                      -1.192e-01  7.243e-01  -0.165 0.869280    
## LanguagesDZ                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesEL                       7.500e-01  7.343e-01   1.021 0.307043    
## LanguagesEN                       2.818e-01  7.235e-01   0.390 0.696878    
## LanguagesEO                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesES                      -6.180e-02  7.245e-01  -0.085 0.932025    
## LanguagesET                       1.528e+01  6.070e+02   0.025 0.979918    
## LanguagesEU                       1.613e+01  1.014e+03   0.016 0.987307    
## LanguagesFA                       9.259e-01  8.019e-01   1.155 0.248219    
## LanguagesFI                       4.405e-01  7.316e-01   0.602 0.547067    
## LanguagesFO                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesFR                      -9.412e-02  7.244e-01  -0.130 0.896622    
## LanguagesGA                       1.386e+01  1.057e+03   0.013 0.989545    
## LanguagesGD                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesGL                       1.613e+01  1.014e+03   0.016 0.987307    
## LanguagesGN                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesGU                       1.388e+01  3.622e+02   0.038 0.969424    
## LanguagesGV                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesHE                       6.088e-01  7.366e-01   0.826 0.408521    
## LanguagesHI                       9.262e-01  7.724e-01   1.199 0.230470    
## LanguagesHR                       9.225e-01  7.940e-01   1.162 0.245305    
## LanguagesHU                       4.824e-01  7.367e-01   0.655 0.512623    
## LanguagesHY                       1.631e+01  4.869e+02   0.033 0.973279    
## LanguagesID                       1.001e+00  7.348e-01   1.363 0.172907    
## LanguagesIS                       1.568e+01  8.083e+02   0.019 0.984522    
## LanguagesIT                       3.862e-02  7.249e-01   0.053 0.957513    
## LanguagesIU                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesJA                       9.991e-02  7.247e-01   0.138 0.890351    
## LanguagesJV                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesKA                       1.613e+01  1.014e+03   0.016 0.987307    
## LanguagesKK                       1.382e+01  1.257e+03   0.011 0.991226    
## LanguagesKL                       1.564e+01  1.335e+03   0.012 0.990654    
## LanguagesKM                       1.605e+01  9.823e+02   0.016 0.986960    
## LanguagesKN                       1.448e+01  3.323e+02   0.044 0.965252    
## LanguagesKO                      -1.151e-01  7.250e-01  -0.159 0.873877    
## LanguagesKR                       1.570e+01  1.353e+03   0.012 0.990738    
## LanguagesKS                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesKU                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesKY                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesLA                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesLO                       1.605e+01  9.823e+02   0.016 0.986960    
## LanguagesLT                       8.080e-01  9.738e-01   0.830 0.406653    
## LanguagesLV                       1.466e+01  3.290e+02   0.045 0.964465    
## LanguagesMG                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesMK                       1.561e+01  7.233e+02   0.022 0.982786    
## LanguagesML                       1.445e+01  3.592e+02   0.040 0.967918    
## LanguagesMN                       1.662e+01  1.301e+03   0.013 0.989808    
## LanguagesMR                       1.452e+01  3.414e+02   0.043 0.966074    
## LanguagesMS                       9.726e-01  7.407e-01   1.313 0.189153    
## LanguagesMT                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesMY                       1.568e+01  8.129e+02   0.019 0.984615    
## LanguagesNB                       3.984e-01  7.312e-01   0.545 0.585882    
## LanguagesNE                       1.613e+01  1.014e+03   0.016 0.987307    
## LanguagesNL                       2.372e-01  7.266e-01   0.327 0.744044    
## LanguagesNN                       9.433e-01  9.251e-01   1.020 0.307865    
## LanguagesNO                       8.953e-01  7.861e-01   1.139 0.254731    
## LanguagesOM                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesOR                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesPA                       1.356e+01  3.849e+02   0.035 0.971902    
## LanguagesPL                       1.898e-01  7.269e-01   0.261 0.794048    
## LanguagesPS                       1.382e+01  1.257e+03   0.011 0.991226    
## LanguagesPT                       7.204e-02  7.250e-01   0.099 0.920855    
## LanguagesQU                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesRN                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesRO                       1.367e+00  7.514e-01   1.819 0.068855 .  
## LanguagesRU                      -1.052e-02  7.247e-01  -0.015 0.988416    
## LanguagesRW                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesSA                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesSD                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesSE                       5.722e-01  1.014e+00   0.564 0.572510    
## LanguagesSI                       1.593e+01  8.979e+02   0.018 0.985849    
## LanguagesSK                       4.184e-01  7.378e-01   0.567 0.570662    
## LanguagesSL                       1.175e+00  9.102e-01   1.291 0.196783    
## LanguagesSO                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesSQ                       1.565e+01  6.801e+02   0.023 0.981635    
## LanguagesSR                       1.162e+00  9.104e-01   1.276 0.201985    
## LanguagesSU                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesSV                       2.146e-01  7.272e-01   0.295 0.767913    
## LanguagesSW                       1.613e+01  1.014e+03   0.016 0.987307    
## LanguagesTA                       1.448e+01  3.323e+02   0.044 0.965252    
## LanguagesTE                       1.440e+01  3.489e+02   0.041 0.967079    
## LanguagesTG                       1.382e+01  1.257e+03   0.011 0.991226    
## LanguagesTH                       6.569e-01  7.318e-01   0.898 0.369392    
## LanguagesTI                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesTK                       1.382e+01  1.257e+03   0.011 0.991226    
## LanguagesTL                       1.561e+01  6.843e+02   0.023 0.981798    
## LanguagesTO                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesTR                       4.084e-01  7.272e-01   0.562 0.574343    
## LanguagesTT                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesUG                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesUK                       1.043e+00  7.473e-01   1.396 0.162620    
## LanguagesUR                       1.507e+01  6.271e+02   0.024 0.980828    
## LanguagesUZ                       1.382e+01  1.257e+03   0.011 0.991226    
## LanguagesVI                       8.172e-01  7.350e-01   1.112 0.266235    
## LanguagesYI                       1.448e+01  1.976e+03   0.007 0.994155    
## LanguagesZH                       1.580e-01  7.245e-01   0.218 0.827350    
## LanguagesZU                       1.769e+01  2.283e+03   0.008 0.993817    
## Size                             -1.692e-09  4.364e-11 -38.783  < 2e-16 ***
## `Primary Genre`Business          -1.987e+01  5.658e+02  -0.035 0.971988    
## `Primary Genre`Education         -1.698e+01  5.658e+02  -0.030 0.976063    
## `Primary Genre`Entertainment     -1.390e+01  5.658e+02  -0.025 0.980396    
## `Primary Genre`Finance           -1.608e+01  5.658e+02  -0.028 0.977326    
## `Primary Genre`Food & Drink       3.372e-01  1.832e+03   0.000 0.999853    
## `Primary Genre`Games             -1.562e+01  5.658e+02  -0.028 0.977979    
## `Primary Genre`Health & Fitness  -1.963e+01  5.658e+02  -0.035 0.972330    
## `Primary Genre`Lifestyle         -1.891e+01  5.658e+02  -0.033 0.973342    
## `Primary Genre`Medical           -1.437e+01  9.196e+02  -0.016 0.987535    
## `Primary Genre`Music             -4.403e-01  6.574e+02  -0.001 0.999466    
## `Primary Genre`Navigation        -8.158e-01  1.914e+03   0.000 0.999660    
## `Primary Genre`News               4.942e-01  1.496e+03   0.000 0.999736    
## `Primary Genre`Productivity      -2.083e+01  5.658e+02  -0.037 0.970633    
## `Primary Genre`Reference         -1.571e+01  5.658e+02  -0.028 0.977844    
## `Primary Genre`Shopping           5.897e-01  2.352e+03   0.000 0.999800    
## `Primary Genre`Social Networking  9.429e-01  7.467e+02   0.001 0.998992    
## `Primary Genre`Sports            -1.357e+01  5.658e+02  -0.024 0.980870    
## `Primary Genre`Stickers          -1.671e+01  5.658e+02  -0.030 0.976440    
## `Primary Genre`Travel             9.043e-01  2.047e+03   0.000 0.999648    
## `Primary Genre`Utilities         -1.694e+01  5.658e+02  -0.030 0.976113    
## GenresAdventure                  -9.832e-02  1.071e-01  -0.918 0.358385    
## GenresBoard                      -4.650e-01  7.714e-02  -6.029 1.65e-09 ***
## GenresBooks                      -1.924e-01  6.811e-01  -0.282 0.777582    
## GenresBusiness                   -1.517e+00  4.286e-01  -3.538 0.000403 ***
## GenresCard                        8.825e-02  1.139e-01   0.775 0.438540    
## GenresCasino                      1.571e+00  7.276e-01   2.159 0.030850 *  
## GenresCasual                      6.757e-01  1.280e-01   5.280 1.29e-07 ***
## GenresDrink                       7.567e-01  7.442e-01   1.017 0.309268    
## GenresEducation                  -2.767e-01  1.167e-01  -2.371 0.017746 *  
## GenresEmoji                      -7.134e-01  1.593e+00  -0.448 0.654271    
## GenresEntertainment              -7.729e-02  6.641e-02  -1.164 0.244498    
## GenresExpressions                -7.134e-01  1.593e+00  -0.448 0.654271    
## GenresFamily                      2.734e-01  1.396e-01   1.959 0.050089 .  
## GenresFinance                     1.299e+00  7.840e-01   1.657 0.097569 .  
## GenresFitness                     1.106e+00  1.192e+00   0.928 0.353464    
## GenresFood                        7.567e-01  7.442e-01   1.017 0.309268    
## GenresGames                      -1.914e-01  6.252e-02  -3.061 0.002203 ** 
## GenresGaming                     -1.962e-01  1.430e+00  -0.137 0.890849    
## GenresHealth                      1.106e+00  1.192e+00   0.928 0.353464    
## GenresKids                        1.626e+01  3.956e+03   0.004 0.996720    
## GenresLifestyle                   1.926e+00  2.605e-01   7.393 1.43e-13 ***
## GenresMagazines                   1.400e+01  3.956e+03   0.004 0.997177    
## GenresMedical                    -1.678e+00  1.229e+00  -1.366 0.171968    
## GenresMusic                       1.810e+00  5.175e-01   3.498 0.000469 ***
## GenresNavigation                  1.551e+01  1.416e+03   0.011 0.991260    
## GenresNetworking                  3.440e-01  3.447e-01   0.998 0.318405    
## GenresNews                        5.213e-01  1.174e+00   0.444 0.657054    
## GenresNewspapers                  1.400e+01  3.956e+03   0.004 0.997177    
## GenresPhoto                       1.543e+01  1.196e+03   0.013 0.989709    
## GenresPlaying                    -4.054e-01  8.811e-02  -4.601 4.21e-06 ***
## GenresProductivity                1.020e-01  4.502e-01   0.226 0.820845    
## GenresPuzzle                     -1.003e-02  7.648e-02  -0.131 0.895698    
## GenresRacing                     -3.996e-01  3.989e-01  -1.002 0.316393    
## GenresReference                  -9.283e-01  3.215e-01  -2.887 0.003884 ** 
## GenresRole                       -4.054e-01  8.811e-02  -4.601 4.21e-06 ***
## GenresShopping                   -1.332e-01  4.567e+03   0.000 0.999977    
## GenresSimulation                 -8.519e-01  7.628e-02 -11.168  < 2e-16 ***
## GenresSocial                      3.440e-01  3.447e-01   0.998 0.318405    
## GenresSports                      4.094e-01  1.743e-01   2.349 0.018841 *  
## GenresStickers                   -1.962e-01  1.430e+00  -0.137 0.890849    
## GenresStrategy                   -1.915e-01  6.252e-02  -3.063 0.002195 ** 
## GenresTravel                     -5.789e-01  5.062e-01  -1.143 0.252838    
## GenresTrivia                      1.612e+00  3.408e-01   4.731 2.24e-06 ***
## GenresUtilities                   1.243e-01  2.268e-01   0.548 0.583510    
## GenresVideo                       1.543e+01  1.196e+03   0.013 0.989709    
## GenresWeather                    -1.984e+01  3.956e+03  -0.005 0.995998    
## GenresWord                        6.339e-01  5.128e-01   1.236 0.216410    
## sum.iap                           2.593e-02  8.049e-04  32.217  < 2e-16 ***
## count.iap                        -8.947e-02  6.639e-03 -13.475  < 2e-16 ***
## iap.class$0.01-$10.00             1.591e+00  3.153e-02  50.454  < 2e-16 ***
## iap.class$10.01-$20.00            1.170e+00  7.322e-02  15.975  < 2e-16 ***
## iap.class$20.01-$30.00           -3.538e-01  1.419e-01  -2.494 0.012619 *  
## iap.class$30.01-$40.00           -7.892e-01  1.396e-01  -5.654 1.57e-08 ***
## iap.class$40.01-$80.00            9.810e+00  1.182e+02   0.083 0.933835    
## days.since.release               -8.451e-04  1.335e-05 -63.289  < 2e-16 ***
## days.since.last.update            7.579e-04  1.684e-05  45.015  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 81752  on 101228  degrees of freedom
## Residual deviance: 57415  on 101034  degrees of freedom
## AIC: 57805
## 
## Number of Fisher Scoring iterations: 16
## [1] 0.1150164

Unfortunately, there are even less significant language variables with 2/113 being significant and less significant sub genres with 17/48 being significant. Also, the iap class of $40 to $80 is insignificant as well. On the other hand, we have a new high score and we were able to decrease our mce to 0.115.

SInce the clean full width model had a smaller mce, we filtered to keep only the significant variables and redid the model. However, filtering left the dataset with 592 expanded rows out of the once 99,000 rows. And because they were expanded, in reality it’s probably only about 200 different apps so the glm function was unable to run due to too few variables.

Keeping Only Significant Variables and Redo the Regression

## # A tibble: 569 x 12
##    `Average User Rati~ `User Rating Coun~ `Age Rating` Languages    Size Genres 
##                  <dbl>              <dbl> <chr>        <chr>       <dbl> <chr>  
##  1                 4.5             143719 4+           BG         1.10e8 Games  
##  2                 4.5             143719 4+           BG         1.10e8 Strate~
##  3                 4.5             143719 4+           BG         1.10e8 Board  
##  4                 4.5             143719 4+           RO         1.10e8 Games  
##  5                 4.5             143719 4+           RO         1.10e8 Strate~
##  6                 4.5             143719 4+           RO         1.10e8 Board  
##  7                 3                 3909 17+          RO         1.31e8 Games  
##  8                 3                 3909 17+          RO         1.31e8 Simula~
##  9                 3                 3909 17+          RO         1.31e8 Strate~
## 10                 3.5                244 9+           RO         5.12e7 Games  
## # ... with 559 more rows, and 6 more variables: sum.iap <dbl>, count.iap <int>,
## #   iap.class <chr>, free <dbl>, days.since.release <dbl>,
## #   days.since.last.update <dbl>
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

So instead of filtering by “and” we filtered by “or.” Meaning that as long as a row contained any significant variable, we would use it for the model. And as you can see, we were left with significantly more data to work with, containing 93,000 rows. Also, for the sig.or model, because previously none of the Primary Genres were significant with p values around 0.95, we removed the variable completely.

Removing if the Row ONLY has Insignificant Values and Redo Regression

## # A tibble: 93,152 x 12
##    `Average User Rati~ `User Rating Coun~ `Age Rating`    Size Genres  Languages
##                  <dbl>              <dbl> <chr>          <dbl> <chr>   <chr>    
##  1                   4               3553 4+            1.59e7 Games   DA       
##  2                   4               3553 4+            1.59e7 Strate~ DA       
##  3                   4               3553 4+            1.59e7 Games   NL       
##  4                   4               3553 4+            1.59e7 Strate~ NL       
##  5                   4               3553 4+            1.59e7 Games   EN       
##  6                   4               3553 4+            1.59e7 Strate~ EN       
##  7                   4               3553 4+            1.59e7 Games   FI       
##  8                   4               3553 4+            1.59e7 Strate~ FI       
##  9                   4               3553 4+            1.59e7 Games   FR       
## 10                   4               3553 4+            1.59e7 Strate~ FR       
## # ... with 93,142 more rows, and 6 more variables: sum.iap <dbl>,
## #   count.iap <int>, iap.class <chr>, free <dbl>, days.since.release <dbl>,
## #   days.since.last.update <dbl>
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Call:
## glm(formula = free ~ ., family = binomial, data = logit.data.sig.or)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5474   0.1017   0.2665   0.4744   2.4329  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             1.408e+00  7.360e-01   1.913  0.05579 .  
## `Average User Rating`  -1.085e-01  1.672e-02  -6.492 8.49e-11 ***
## `User Rating Count`     1.260e-05  8.668e-07  14.532  < 2e-16 ***
## `Age Rating`17+         9.032e-01  8.511e-02  10.612  < 2e-16 ***
## `Age Rating`4+          5.028e-01  3.495e-02  14.388  < 2e-16 ***
## `Age Rating`9+         -6.209e-01  3.749e-02 -16.563  < 2e-16 ***
## Size                   -1.579e-09  4.369e-11 -36.141  < 2e-16 ***
## GenresAdventure         1.959e-01  1.649e-01   1.188  0.23483    
## GenresBoard            -1.674e-01  8.800e-02  -1.902  0.05715 .  
## GenresBooks             1.537e+01  8.340e+02   0.018  0.98530    
## GenresBusiness         -2.040e+00  3.435e-01  -5.937 2.90e-09 ***
## GenresCard              7.597e-01  1.924e-01   3.949 7.85e-05 ***
## GenresCasino            1.851e+00  7.286e-01   2.541  0.01107 *  
## GenresCasual            1.017e+00  1.339e-01   7.595 3.07e-14 ***
## GenresDrink             1.441e+01  4.718e+02   0.031  0.97564    
## GenresEducation        -2.575e-01  1.200e-01  -2.146  0.03190 *  
## GenresEntertainment     3.475e-01  8.519e-02   4.079 4.53e-05 ***
## GenresFamily            6.168e-01  1.427e-01   4.322 1.55e-05 ***
## GenresFinance           1.514e+00  7.818e-01   1.937  0.05275 .  
## GenresFitness          -1.500e-01  1.056e+00  -0.142  0.88696    
## GenresFood              1.441e+01  4.718e+02   0.031  0.97564    
## GenresGames             8.741e-02  7.560e-02   1.156  0.24760    
## GenresHealth           -1.500e-01  1.056e+00  -0.142  0.88696    
## GenresLifestyle         1.836e+00  2.531e-01   7.255 4.01e-13 ***
## GenresMagazines         1.436e+01  3.956e+03   0.004  0.99710    
## GenresMedical          -1.833e+01  2.795e+03  -0.007  0.99477    
## GenresMusic             2.304e+00  5.154e-01   4.470 7.82e-06 ***
## GenresNavigation        1.471e+01  2.742e+03   0.005  0.99572    
## GenresNetworking        3.457e-02  3.416e-01   0.101  0.91938    
## GenresNews              1.544e+01  1.876e+03   0.008  0.99344    
## GenresNewspapers        1.436e+01  3.956e+03   0.004  0.99710    
## GenresPhoto             1.487e+01  1.574e+03   0.009  0.99246    
## GenresPlaying          -2.117e-01  9.621e-02  -2.200  0.02779 *  
## GenresProductivity     -4.512e+00  4.416e-01 -10.218  < 2e-16 ***
## GenresPuzzle            4.431e-01  1.210e-01   3.663  0.00025 ***
## GenresRacing           -2.074e+00  3.380e-01  -6.137 8.43e-10 ***
## GenresReference        -6.434e-01  3.001e-01  -2.144  0.03205 *  
## GenresRole             -2.117e-01  9.621e-02  -2.200  0.02779 *  
## GenresSimulation       -5.511e-01  8.653e-02  -6.369 1.90e-10 ***
## GenresSocial            3.457e-02  3.416e-01   0.101  0.91938    
## GenresSports            9.937e-01  1.769e-01   5.618 1.93e-08 ***
## GenresStrategy          8.734e-02  7.560e-02   1.155  0.24795    
## GenresTravel            1.173e+00  1.040e+00   1.128  0.25942    
## GenresTrivia            1.864e+00  3.361e-01   5.545 2.94e-08 ***
## GenresUtilities         1.076e+00  5.393e-01   1.995  0.04607 *  
## GenresVideo             1.487e+01  1.574e+03   0.009  0.99246    
## GenresWord              1.330e+00  1.029e+00   1.293  0.19611    
## LanguagesAM             1.668e+01  1.394e+03   0.012  0.99046    
## LanguagesAR             1.347e+00  7.478e-01   1.801  0.07165 .  
## LanguagesAS             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesAY             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesAZ             1.613e+01  1.052e+03   0.015  0.98777    
## LanguagesBE             1.581e+01  8.774e+02   0.018  0.98562    
## LanguagesBG             2.461e+00  9.100e-01   2.704  0.00684 ** 
## LanguagesBN             1.497e+01  3.118e+02   0.048  0.96172    
## LanguagesBO             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesBR             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesBS             1.528e+01  8.868e+02   0.017  0.98626    
## LanguagesCA             9.991e-01  7.504e-01   1.331  0.18308    
## LanguagesCS             8.977e-01  7.369e-01   1.218  0.22311    
## LanguagesCY             1.559e+01  1.434e+03   0.011  0.99133    
## LanguagesDA             5.852e-01  7.351e-01   0.796  0.42599    
## LanguagesDE             1.129e-01  7.285e-01   0.155  0.87684    
## LanguagesDZ             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesEL             9.417e-01  7.395e-01   1.273  0.20285    
## LanguagesEN             4.961e-01  7.276e-01   0.682  0.49533    
## LanguagesEO             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesES             1.934e-01  7.287e-01   0.265  0.79074    
## LanguagesET             1.537e+01  6.049e+02   0.025  0.97973    
## LanguagesEU             1.613e+01  1.052e+03   0.015  0.98777    
## LanguagesFA             1.219e+00  8.201e-01   1.486  0.13729    
## LanguagesFI             6.579e-01  7.369e-01   0.893  0.37198    
## LanguagesFO             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesFR             1.370e-01  7.286e-01   0.188  0.85084    
## LanguagesGA             1.492e+01  1.002e+03   0.015  0.98812    
## LanguagesGD             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesGL             1.613e+01  1.052e+03   0.015  0.98777    
## LanguagesGN             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesGU             1.426e+01  3.534e+02   0.040  0.96780    
## LanguagesGV             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesHE             8.331e-01  7.417e-01   1.123  0.26133    
## LanguagesHI             1.282e+00  7.884e-01   1.626  0.10398    
## LanguagesHR             1.427e+00  8.300e-01   1.719  0.08563 .  
## LanguagesHU             6.244e-01  7.414e-01   0.842  0.39964    
## LanguagesHY             1.636e+01  5.157e+02   0.032  0.97469    
## LanguagesID             1.273e+00  7.408e-01   1.719  0.08561 .  
## LanguagesIS             1.567e+01  8.334e+02   0.019  0.98500    
## LanguagesIT             2.666e-01  7.291e-01   0.366  0.71468    
## LanguagesIU             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesJA             3.222e-01  7.289e-01   0.442  0.65851    
## LanguagesJV             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesKA             1.613e+01  1.052e+03   0.015  0.98777    
## LanguagesKK             1.409e+01  1.261e+03   0.011  0.99109    
## LanguagesKL             1.595e+01  1.333e+03   0.012  0.99045    
## LanguagesKM             1.607e+01  1.028e+03   0.016  0.98753    
## LanguagesKN             1.465e+01  3.306e+02   0.044  0.96466    
## LanguagesKO             1.209e-01  7.292e-01   0.166  0.86837    
## LanguagesKR             1.581e+01  1.320e+03   0.012  0.99044    
## LanguagesKS             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesKU             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesKY             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesLA             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesLO             1.607e+01  1.028e+03   0.016  0.98753    
## LanguagesLT             7.773e-01  9.808e-01   0.792  0.42809    
## LanguagesLV             1.479e+01  3.298e+02   0.045  0.96423    
## LanguagesMG             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesMK             1.564e+01  7.571e+02   0.021  0.98352    
## LanguagesML             1.459e+01  3.566e+02   0.041  0.96736    
## LanguagesMN             1.668e+01  1.394e+03   0.012  0.99046    
## LanguagesMR             1.469e+01  3.386e+02   0.043  0.96540    
## LanguagesMS             1.231e+00  7.466e-01   1.649  0.09923 .  
## LanguagesMT             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesMY             1.566e+01  8.395e+02   0.019  0.98512    
## LanguagesNB             6.132e-01  7.365e-01   0.833  0.40503    
## LanguagesNE             1.613e+01  1.052e+03   0.015  0.98777    
## LanguagesNL             4.519e-01  7.311e-01   0.618  0.53650    
## LanguagesNN             1.607e+00  1.064e+00   1.510  0.13117    
## LanguagesNO             8.033e-01  7.963e-01   1.009  0.31302    
## LanguagesOM             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesOR             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesPA             1.383e+01  3.865e+02   0.036  0.97146    
## LanguagesPL             4.268e-01  7.314e-01   0.584  0.55950    
## LanguagesPS             1.409e+01  1.261e+03   0.011  0.99109    
## LanguagesPT             3.056e-01  7.293e-01   0.419  0.67522    
## LanguagesQU             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesRN             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesRO             1.633e+00  7.550e-01   2.164  0.03050 *  
## LanguagesRU             2.458e-01  7.290e-01   0.337  0.73599    
## LanguagesRW             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesSA             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesSD             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesSE             5.839e-01  1.040e+00   0.561  0.57452    
## LanguagesSI             1.595e+01  9.321e+02   0.017  0.98635    
## LanguagesSK             5.797e-01  7.425e-01   0.781  0.43496    
## LanguagesSL             1.842e+00  1.045e+00   1.762  0.07802 .  
## LanguagesSO             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesSQ             1.574e+01  7.089e+02   0.022  0.98229    
## LanguagesSR             1.835e+00  1.045e+00   1.755  0.07918 .  
## LanguagesSU             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesSV             4.119e-01  7.318e-01   0.563  0.57351    
## LanguagesSW             1.613e+01  1.052e+03   0.015  0.98777    
## LanguagesTA             1.465e+01  3.306e+02   0.044  0.96466    
## LanguagesTE             1.455e+01  3.477e+02   0.042  0.96661    
## LanguagesTG             1.409e+01  1.261e+03   0.011  0.99109    
## LanguagesTH             9.369e-01  7.372e-01   1.271  0.20378    
## LanguagesTI             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesTK             1.409e+01  1.261e+03   0.011  0.99109    
## LanguagesTL             1.571e+01  7.104e+02   0.022  0.98236    
## LanguagesTO             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesTR             6.426e-01  7.317e-01   0.878  0.37985    
## LanguagesTT             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesUG             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesUK             1.386e+00  7.553e-01   1.835  0.06651 .  
## LanguagesUR             1.534e+01  6.052e+02   0.025  0.97978    
## LanguagesUZ             1.409e+01  1.261e+03   0.011  0.99109    
## LanguagesVI             1.104e+00  7.409e-01   1.490  0.13614    
## LanguagesYI             1.476e+01  1.975e+03   0.007  0.99404    
## LanguagesZH             3.738e-01  7.288e-01   0.513  0.60801    
## LanguagesZU             1.799e+01  2.797e+03   0.006  0.99487    
## sum.iap                 2.543e-02  7.938e-04  32.031  < 2e-16 ***
## count.iap              -8.663e-02  6.574e-03 -13.178  < 2e-16 ***
## iap.class$0.01-$10.00   1.509e+00  3.317e-02  45.490  < 2e-16 ***
## iap.class$10.01-$20.00  1.051e+00  7.180e-02  14.639  < 2e-16 ***
## iap.class$20.01-$30.00 -3.587e-01  1.404e-01  -2.556  0.01060 *  
## iap.class$30.01-$40.00 -7.776e-01  1.392e-01  -5.588 2.30e-08 ***
## iap.class$40.01-$80.00  1.007e+01  1.383e+02   0.073  0.94195    
## days.since.release     -8.209e-04  1.428e-05 -57.473  < 2e-16 ***
## days.since.last.update  7.172e-04  1.812e-05  39.585  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 71103  on 93151  degrees of freedom
## Residual deviance: 50834  on 92984  degrees of freedom
## AIC: 51170
## 
## Number of Fisher Scoring iterations: 16
## [1] 0.1077057

Looking at the sig.or model, we have similar significant coefficients to the base full width model in Languages and sub genres, but we have an even lower mce at 0.1077. Our significant variables are size, count and sum iap, days since release and last update, user rating count and age rating.

On logit.err we filtered to make sure that each column had the significant variables, but because there were so few in the languages significant variables we couldn’t get enough data to make a model. Because there were so few significant variables, this time we removed the language column and then filtered by the significant variables

Keeping Only Significant Variables (Minus Language) and Redo the Regression

## # A tibble: 54,136 x 11
##    `Average User Ratin~ `User Rating Coun~ `Age Rating`     Size Genres  sum.iap
##                   <dbl>              <dbl> <chr>           <dbl> <chr>     <dbl>
##  1                    3                 47 4+             4.87e7 Games      1.99
##  2                    3                 47 4+             4.87e7 Strate~    1.99
##  3                    3                112 4+             1.23e8 Games      0.99
##  4                    3                112 4+             1.23e8 Strate~    0.99
##  5                    3                112 4+             1.23e8 Board      0.99
##  6                    3                112 4+             1.23e8 Games      0.99
##  7                    3                112 4+             1.23e8 Strate~    0.99
##  8                    3                112 4+             1.23e8 Board      0.99
##  9                    3                112 4+             1.23e8 Games      0.99
## 10                    3                112 4+             1.23e8 Strate~    0.99
## # ... with 54,126 more rows, and 5 more variables: count.iap <int>,
## #   iap.class <chr>, free <dbl>, days.since.release <dbl>,
## #   days.since.last.update <dbl>
## 
## Call:
## glm(formula = free ~ ., family = binomial, data = logit.data.sig.and.nl)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5888   0.1200   0.2508   0.3872   1.6717  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             2.685e+00  1.577e-01  17.025  < 2e-16 ***
## `Average User Rating`   1.256e-02  2.955e-02   0.425 0.670897    
## `User Rating Count`     5.673e-06  1.017e-06   5.579 2.41e-08 ***
## `Age Rating`17+         5.631e-01  1.591e-01   3.539 0.000402 ***
## `Age Rating`4+          7.357e-01  5.277e-02  13.942  < 2e-16 ***
## `Age Rating`9+         -6.317e-01  5.161e-02 -12.240  < 2e-16 ***
## Size                   -1.055e-09  5.241e-11 -20.125  < 2e-16 ***
## GenresBusiness         -2.227e+00  3.892e-01  -5.722 1.06e-08 ***
## GenresCasino            1.258e+00  1.047e+00   1.201 0.229566    
## GenresCasual            1.038e+00  1.823e-01   5.694 1.24e-08 ***
## GenresEducation         1.054e+00  2.509e-01   4.201 2.66e-05 ***
## GenresFamily            3.563e+00  5.854e-01   6.086 1.15e-09 ***
## GenresFinance           1.092e+00  1.034e+00   1.056 0.290830    
## GenresGames             3.961e-01  7.901e-02   5.013 5.35e-07 ***
## GenresLifestyle         1.678e+00  6.033e-01   2.782 0.005407 ** 
## GenresMusic             3.366e+00  1.008e+00   3.338 0.000843 ***
## GenresPlaying           3.722e-01  1.127e-01   3.301 0.000963 ***
## GenresReference        -1.780e-01  5.080e-01  -0.350 0.726068    
## GenresRole              3.722e-01  1.127e-01   3.301 0.000963 ***
## GenresSimulation       -1.889e-01  9.676e-02  -1.952 0.050889 .  
## GenresSports            1.477e+00  3.025e-01   4.882 1.05e-06 ***
## GenresStrategy          3.961e-01  7.901e-02   5.013 5.35e-07 ***
## GenresTrivia            1.366e+01  9.998e+01   0.137 0.891308    
## sum.iap                 2.498e-02  8.917e-04  28.013  < 2e-16 ***
## count.iap              -8.592e-02  7.615e-03 -11.282  < 2e-16 ***
## iap.class$10.01-$20.00 -6.603e-01  7.507e-02  -8.796  < 2e-16 ***
## iap.class$20.01-$30.00 -1.885e+00  1.581e-01 -11.925  < 2e-16 ***
## iap.class$30.01-$40.00 -2.178e+00  1.574e-01 -13.836  < 2e-16 ***
## days.since.release     -8.818e-04  2.336e-05 -37.755  < 2e-16 ***
## days.since.last.update  3.794e-04  2.963e-05  12.802  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27563  on 54135  degrees of freedom
## Residual deviance: 22195  on 54106  degrees of freedom
## AIC: 22255
## 
## Number of Fisher Scoring iterations: 14
## [1] 0.0691961

After looking through the Language and sub genre column, we found that there are a lot of languages and sub genres with very few occurrences. We figured if a variable occurred in an app 10-30 times it wouldn’t provide much data to successfully predict anything. So we found the counts of all languages and subgenres and if it occurred less than 30 times we named it “other”.

Finding the Counts of Each Language and Genre then Putting them into a Different Class if Below a Certain Count

## # A tibble: 6 x 3
##   Languages     n lang.new
##   <chr>     <int> <chr>   
## 1 SD            1 Other   
## 2 SO            1 Other   
## 3 SU            1 Other   
## 4 TI            1 Other   
## 5 TO            1 Other   
## 6 TT            1 Other
## # A tibble: 6 x 3
##   Genres          n genre.new
##   <chr>       <int> <chr>    
## 1 Medical         3 Other    
## 2 Stickers        3 Other    
## 3 Emoji           2 Other    
## 4 Expressions     2 Other    
## 5 Kids            1 Other    
## 6 Magazines       1 Other

We then combined the new variables to the previous data using left_join and reran practically the same model. The only difference is that instead of filtering for the significant languages and sub genres, renaming any language or sub genre that occurred less than 30 times, “other.”

## # A tibble: 3 x 12
##   `Average User Rati~ `User Rating Coun~ `Age Rating`    Size genre.new lang.new
##                 <dbl>              <dbl> <chr>          <dbl> <chr>     <chr>   
## 1                 4.5                822 12+           7.78e8 Other     Other   
## 2                 4.5               1026 4+            5.65e7 Other     Other   
## 3                 4.5               1026 4+            5.65e7 Other     Other   
## # ... with 6 more variables: sum.iap <dbl>, count.iap <int>, iap.class <chr>,
## #   free <dbl>, days.since.release <dbl>, days.since.last.update <dbl>
## 
## Call:
## glm(formula = free ~ ., family = binomial, data = logit.data.count)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5746   0.1168   0.2447   0.3805   1.6717  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             4.244e+00  2.646e-01  16.043  < 2e-16 ***
## `Average User Rating`  -1.897e-02  2.648e-02  -0.716 0.473711    
## `User Rating Count`     6.112e-06  8.679e-07   7.043 1.89e-12 ***
## `Age Rating`17+         3.386e-01  1.352e-01   2.505 0.012238 *  
## `Age Rating`4+          5.970e-01  4.753e-02  12.560  < 2e-16 ***
## `Age Rating`9+         -7.058e-01  4.647e-02 -15.187  < 2e-16 ***
## Size                   -1.045e-09  4.618e-11 -22.619  < 2e-16 ***
## genre.newAdventure      1.693e-01  1.650e-01   1.026 0.304943    
## genre.newBoard         -4.014e-01  1.036e-01  -3.876 0.000106 ***
## genre.newCard           5.421e-01  1.899e-01   2.855 0.004308 ** 
## genre.newCasual         6.387e-01  1.819e-01   3.511 0.000447 ***
## genre.newEducation      6.439e-01  2.531e-01   2.544 0.010972 *  
## genre.newEntertainment  3.170e-01  8.492e-02   3.733 0.000189 ***
## genre.newFamily         3.121e+00  5.854e-01   5.332 9.70e-08 ***
## genre.newGames         -4.023e-03  7.731e-02  -0.052 0.958497    
## genre.newLifestyle      1.301e+00  6.041e-01   2.153 0.031321 *  
## genre.newMusic          2.916e+00  1.009e+00   2.889 0.003863 ** 
## genre.newNetworking     1.050e-01  3.407e-01   0.308 0.757919    
## genre.newOther         -1.108e+00  1.777e-01  -6.235 4.50e-10 ***
## genre.newPlaying       -6.077e-02  1.095e-01  -0.555 0.578967    
## genre.newPuzzle         4.082e-01  1.218e-01   3.350 0.000807 ***
## genre.newRacing        -2.067e+00  3.424e-01  -6.036 1.58e-09 ***
## genre.newReference     -5.711e-01  5.078e-01  -1.125 0.260711    
## genre.newRole          -6.077e-02  1.095e-01  -0.555 0.578967    
## genre.newSimulation    -5.680e-01  9.505e-02  -5.976 2.28e-09 ***
## genre.newSocial         1.050e-01  3.407e-01   0.308 0.757919    
## genre.newSports         1.106e+00  3.031e-01   3.648 0.000264 ***
## genre.newStrategy      -4.023e-03  7.731e-02  -0.052 0.958497    
## genre.newTravel         1.172e+00  1.046e+00   1.120 0.262836    
## genre.newTrivia         1.430e+01  1.635e+02   0.087 0.930308    
## genre.newUtilities      1.124e+00  5.421e-01   2.073 0.038182 *  
## genre.newWord           1.309e+00  1.033e+00   1.267 0.205024    
## lang.newBG              1.184e+01  1.905e+02   0.062 0.950447    
## lang.newBN              1.224e+01  2.041e+02   0.060 0.952174    
## lang.newCA             -1.867e-01  3.549e-01  -0.526 0.598796    
## lang.newCS             -2.865e-01  2.940e-01  -0.975 0.329776    
## lang.newDA             -1.074e+00  2.575e-01  -4.171 3.04e-05 ***
## lang.newDE             -1.143e+00  2.282e-01  -5.008 5.51e-07 ***
## lang.newEL             -6.216e-01  2.805e-01  -2.216 0.026674 *  
## lang.newEN             -9.176e-01  2.227e-01  -4.120 3.78e-05 ***
## lang.newES             -1.060e+00  2.293e-01  -4.621 3.81e-06 ***
## lang.newFA             -9.519e-01  5.599e-01  -1.700 0.089102 .  
## lang.newFI             -9.902e-01  2.672e-01  -3.706 0.000210 ***
## lang.newFR             -1.095e+00  2.289e-01  -4.784 1.72e-06 ***
## lang.newHE             -8.460e-01  2.817e-01  -3.003 0.002675 ** 
## lang.newHI             -1.777e-01  4.747e-01  -0.374 0.708073    
## lang.newHR              2.396e-01  6.272e-01   0.382 0.702457    
## lang.newHU             -1.053e+00  2.776e-01  -3.792 0.000149 ***
## lang.newID             -8.781e-02  2.975e-01  -0.295 0.767911    
## lang.newIT             -1.010e+00  2.320e-01  -4.355 1.33e-05 ***
## lang.newJA             -1.044e+00  2.299e-01  -4.541 5.60e-06 ***
## lang.newKO             -1.208e+00  2.311e-01  -5.229 1.70e-07 ***
## lang.newMS             -1.465e-01  3.211e-01  -0.456 0.648313    
## lang.newNB             -8.904e-01  2.686e-01  -3.314 0.000919 ***
## lang.newNL             -7.945e-01  2.461e-01  -3.228 0.001247 ** 
## lang.newNO             -1.544e+00  4.679e-01  -3.299 0.000969 ***
## lang.newOther           9.856e-01  4.673e-01   2.109 0.034943 *  
## lang.newPL             -9.147e-01  2.460e-01  -3.718 0.000200 ***
## lang.newPT             -9.440e-01  2.330e-01  -4.051 5.10e-05 ***
## lang.newRO              4.875e-01  4.420e-01   1.103 0.270028    
## lang.newRU             -9.776e-01  2.307e-01  -4.238 2.26e-05 ***
## lang.newSK             -1.089e+00  2.780e-01  -3.918 8.93e-05 ***
## lang.newSL              1.160e+01  2.065e+02   0.056 0.955221    
## lang.newSV             -8.749e-01  2.517e-01  -3.476 0.000509 ***
## lang.newTH             -3.596e-01  2.806e-01  -1.282 0.199967    
## lang.newTR             -5.553e-01  2.506e-01  -2.216 0.026718 *  
## lang.newUK              6.600e-01  4.434e-01   1.488 0.136637    
## lang.newVI             -1.525e-01  3.054e-01  -0.499 0.617549    
## lang.newZH             -9.876e-01  2.296e-01  -4.301 1.70e-05 ***
## sum.iap                 2.305e-02  7.906e-04  29.153  < 2e-16 ***
## count.iap              -8.643e-02  6.660e-03 -12.977  < 2e-16 ***
## iap.class$10.01-$20.00 -4.380e-01  6.848e-02  -6.397 1.59e-10 ***
## iap.class$20.01-$30.00 -1.612e+00  1.405e-01 -11.475  < 2e-16 ***
## iap.class$30.01-$40.00 -1.982e+00  1.422e-01 -13.935  < 2e-16 ***
## days.since.release     -9.167e-04  2.060e-05 -44.491  < 2e-16 ***
## days.since.last.update  4.815e-04  2.601e-05  18.511  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 35855  on 71459  degrees of freedom
## Residual deviance: 28725  on 71384  degrees of freedom
## AIC: 28877
## 
## Number of Fisher Scoring iterations: 15
## [1] 0.06766023

Looking at the coefficients for the count model, we have a higher ratio of significant to insignificant variables in languages and sub genres than the previous versions and we yet again decreased our mce, this time to 0.0676. One small issue was that the average user rating variable was insignificant again, but when we tried to remove it and my mce jumped so we kept it in. So this should be our best model right?

To confirm our results, we created a function to find the roc curve and mce for each model and then compare them

Comparison of Evaluation Metrics for All Models

## # A tibble: 367,070 x 4
##    observed predicted    mce class
##       <dbl>     <dbl>  <dbl> <chr>
##  1        1     0.533 0.0677 count
##  2        1     0.611 0.0677 count
##  3        1     0.533 0.0677 count
##  4        1     0.633 0.0677 count
##  5        1     0.685 0.0677 count
##  6        1     0.685 0.0677 count
##  7        1     0.593 0.0677 count
##  8        1     0.645 0.0677 count
##  9        1     0.645 0.0677 count
## 10        1     0.550 0.0677 count
## # ... with 367,060 more rows
## # A tibble: 4 x 2
##   class       mce
##   <chr>     <dbl>
## 1 base.fw  0.137 
## 2 clean.fw 0.115 
## 3 count    0.0677
## 4 sig.or   0.108

## # A tibble: 367,070 x 4
##    observed predicted    mce class       
##       <dbl>     <dbl>  <dbl> <chr>       
##  1        1     0.533 0.0677 count 0.0677
##  2        1     0.611 0.0677 count 0.0677
##  3        1     0.533 0.0677 count 0.0677
##  4        1     0.633 0.0677 count 0.0677
##  5        1     0.685 0.0677 count 0.0677
##  6        1     0.685 0.0677 count 0.0677
##  7        1     0.593 0.0677 count 0.0677
##  8        1     0.645 0.0677 count 0.0677
##  9        1     0.645 0.0677 count 0.0677
## 10        1     0.550 0.0677 count 0.0677
## # ... with 367,060 more rows

Looking first at the mce plot, we can see the 4 different models and their respective mce’s with count being the lowest and base full with being the greatest. However, when we look at the ROC curve, we see that sig.or and clean full width models have curves closer to the edges and thus a greater AUC.

Because misclassification error is calculated using just one threshold, even though count has the smallest mce, because the ROC curve represents both type I and type II errors and shows classification results of all thresholds, sig.or and clean.fw are the better models.

Since clean.fw and sig.or were practically the same, we used sig.or as the best model to calculate some classification evaluation metrics.

Classfication Evaluation Metrics

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor Genres has new levels Magazines, Newspapers
## [1] 0.1077057
##    logit.count.class
##         0     1
##   0  3463  8416
##   1  1617 79656
## [1] 0.9801041
## [1] 0.2915229
## # A tibble: 93,152 x 2
##    observed predicted
##       <dbl>     <dbl>
##  1        0     0.340
##  2        0     0.340
##  3        0     0.311
##  4        0     0.311
##  5        0     0.320
##  6        0     0.320
##  7        0     0.356
##  8        0     0.356
##  9        0     0.247
## 10        0     0.247
## # ... with 93,142 more rows

We weren’t able to find the 10 fold cross validation mce because cv.glm kept giving me an error about the dataset containing too many variables with too few data points. Which were attempted to be removed from the count model earlier. So, we just used the custom function to find mce and found a 10% rate of mistakes are made if we apply our model.

Then we found the confusion matrix using the table function to compare my observed and predicted values of whether the game was free or not. From the confusion matrix we were able to calculate a True Positive Rate of 0.98 and False Positive Rate of 0.7085.

Model evaluation and validations

## 
## Call:
## lm(formula = `Average User Rating` ~ Size, data = clean_games)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0951 -0.5367  0.3238  0.4550  0.9650 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.035e+00  1.006e-02 401.017  < 2e-16 ***
## Size        1.797e-10  3.383e-11   5.312 1.11e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7491 on 7486 degrees of freedom
## Multiple R-squared:  0.003756,   Adjusted R-squared:  0.003623 
## F-statistic: 28.22 on 1 and 7486 DF,  p-value: 1.113e-07
## [1] 0.7491455
## [1] 0.1844232
## [1] 0.003755735

The RSE of our model Average User Rating ≈ f (Size) = β0 + β1 × Size is 0.7491. The percentage of prediction error is 18.4%. About 0.376% of the variability in Average User Rating is explained by a linear regression on Size. F-Statistic is much greater than 1 (28.22) so we can assume there is a relationship between Size and Average User Rating. As we separate size from the other predictor, we can calculate it’s RSE, adjusted R2, and other data. As we can see, the numbers were really close to what we had before except F-statistics is much higher which reinforces the idea that size and Average User Rating do have a strong relationship.

## 
## Call:
## lm(formula = `Average User Rating` ~ Size, data = clean_games)
## 
## Coefficients:
## (Intercept)         Size  
##   4.035e+00    1.797e-10
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

The residual plot suggests that there is some non-linearity in the data.

—————————————————————–

15. What primary genre is similar to the “Games” genre?

##               avg.user.rating user.rating.count      price      size
## Book                 4.300000           57.6000  0.0000000  52586701
## Business             3.000000           16.5000  4.9950000 136453120
## Education            4.152174          124.8913  2.1041304 113516531
## Entertainment        3.831522          171.6087  0.2161957  76567793
## Finance              4.062500         7725.3750 17.4987500  84826496
## Food & Drink         5.000000            7.0000  0.0000000 106633216
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
##   avg.user.rating_mean user.rating.count_mean    price_mean     size_mean
## 1         3.667553e-16          -1.100465e-17 -4.228324e-17 -4.297571e-17
##   avg.user.rating_sd user.rating.count_sd price_sd size_sd
## 1                  1                    1        1       1

The variable “Primary Genre” is used because it is the key genre that each of these games are associated with. After grouping by genres, we found the average of each numerical variable: Average User Rating, User Rating Count, Price, and Size. We found the average because there is a lot of data for each genre, and it should be done in order to scale the data properly. Scaling is important before clustering analysis so it eliminates any bias.

##         kcluster
##  [1,] 80.0000000
##  [2,] 53.3496139
##  [3,] 36.2260487
##  [4,] 26.4420736
##  [5,] 16.3576898
##  [6,] 10.7686989
##  [7,]  8.4816453
##  [8,]  6.4346684
##  [9,]  4.5511644
## [10,]  3.2860759
## [11,]  2.5383748
## [12,]  1.8744334
## [13,]  1.4143411
## [14,]  1.1715535
## [15,]  0.8521516

We ran a for loop to find the best number of clusters, with nstart as 20 because that seems like a stable number of times to rerun. When we plot the elbow method, we find that 6 is the best number of clusters because it doesn’t decrease significantly after the cutoff at 6.

## Joining, by = "label"

When we plotted the dendrogram, we found that the primary genre “Games” is not associated with any other primary genres. There are no primary genres that are similar to “Games”.

## # A tibble: 10 x 4
##    Name                   `Primary Genre` `Average User Ratin~ `User Rating Cou~
##    <chr>                  <chr>                          <dbl>             <dbl>
##  1 "Clash of Clans"       Games                            4.5           3032734
##  2 "Clash Royale"         Games                            4.5           1277095
##  3 "PUBG MOBILE"          Games                            4.5            711409
##  4 "Plants vs. Zombies\\~ Games                            4.5            469562
##  5 "Pok\\xe9mon GO"       Games                            3.5            439776
##  6 "Boom Beach"           Games                            4.5            400787
##  7 "Cash, Inc. Fame & Fo~ Games                            5              374772
##  8 "Idle Miner Tycoon: C~ Games                            4.5            283035
##  9 "TapDefense"           Games                            3.5            273687
## 10 "Star Wars\\u2122: Co~ Games                            4.5            259030

This table shows that the top 10 popular games all tagged the genre “Games” as their primary genre.

—————————————————————–

Conclusions and discussion

Conclusion on what contributes to a game’s success and relationship between initial price and average user rating:

Thanks to the analysis, we can conclude that size is our best predictor for Average User Rating. It had the strongest relationship, and with the plot we did with linear regression lines, we saw that as the size of the game increases, the rating also increases which will contribute to a game’s overall success. Initial price also doesn’t dictate average user rating as much as we thought as a lot of the price points had the same average rating (around 4.5/5) until the price goes up to over 10 dollars. Even so, the average rating for those games average around 4/5.

Using the multiple classification regression, we were able to conclude that the best model to predict if an application will be free or not is the sig.or or clean.fw model. Using the model, we were able to successfully predict 79656 true positive values, 1617 false negative values, 3463 true negative values, and 8417 false positive values.


Conclusion for similarity of primary genres:

We can conclude that the primary genre “Games” is a very distinct genre. For a game to be recognized in the Apple App Store, their genre needs to be marked as “Games”. Since a lot of the popular games also tagged “Games” as their primary genre, we can assume that the genre “Games” is one of the many important factors that contributes to a popular strategy game.

—————————————————————–

Authors’ contributions [clarify each team member’s contribution]

Alexandria Richardson

  • Introduction and Project description
  • Does genre of games cause people to spend more money?
  • Is there a relationship between user rating and in-app purchases?
  • Does the amount of available in-app purchases decrease rating?
  • Which genre of game does better internationally?
  • Which words are most commonly used in Description of games?
  • What is the average price of in-app purchases?

Daniel Han

  • What contributes to a game’s success?
  • What is the relationship between initial price of apps and average user rating?
  • Conclusion on what contributes to a game’s success and relationship between initial price and average user rating.

Kristoffer Hernandez

  • Which words are most commonly used in Description of games?
  • Which genre is most popular?
  • What are the top languages in which games are offered?
  • What is the distribution of user rating across genre?
  • How has the size of the applications of the top 3 primary genres changed over the span of about 11 years?
  • Can we predict if an app is free or not?

Goldie Starla

  • What information can we find about game developers and their strategy games?
  • What is the frequency of the age groups?
  • What primary genre is similar to the “Games” genre?
  • Conclusion for similarity of primary genres

—————————————————————–

References

Tristan. “17K Mobile Strategy Games.” Kaggle, 26 Aug. 2019, www.kaggle.com/tristan581/17k-apple-app-store-strategy-games.